Monitoring Failure

Embed Size (px)

Citation preview

  • 8/3/2019 Monitoring Failure

    1/13

    Monitoring High-Dimensional Data for FailureDetection and Localization in Large-Scale

    Computing SystemsHaifeng Chen, Guofei Jiang, and Kenji Yoshihira

    AbstractIt is a major challenge to process high-dimensional measurements for failure detection and localization in large-scale

    computing systems. However, it is observed that in information systems, those measurements are usually located in a low-dimensional

    structure that is embedded in the high-dimensional space. From this perspective, a novel approach is proposed to model the geometry

    of underlying data generation and detect anomalies based on that model. We consider both linear and nonlinear data generation

    models. Two statistics, that is, the Hotelling T2 and the squared prediction error (SPE), are used to reflect data variations within and

    outside the model. We track the probabilistic density of extracted statistics to monitor the systems health. After a failure has been

    detected, a localization process is also proposed to find the most suspicious attributes related to the failure. Experimental results on

    both synthetic data and a real e-commerce application demonstrate the effectiveness of our approach in detecting and localizing

    failures in computing systems.

    Index TermsFailure detection, manifold learning, statistics, data mining, information system, Internet applications.

    1 INTRODUCTION

    DETECTING failures promptly in large-scale Internetservices becomes more critical. A single hour ofdowntime of those services such as Google, MSN, andYahoo! could often result in millions of dollars of lostrevenue, bad publicity, and click over to competitors.One significant problem in building such detection tools isthe high dimensionality of measurements collected fromthe large-scale computing infrastructures. For example,

    commercial frameworks such as HPs OpenView [10] andIBMs Tivoli [19] aggregate attributes from a variety ofsources, including hardware, networking, operating sys-tems, and application servers. It is hard to extractmeaningful information from those data to distinguishanomalous situations from normal ones.

    In many circumstances, however, the system measure-ments are not truly high dimensional. Rather, they canefficiently be summarized in a space with much lowerdimension, because many of the attributes are correlated.For instance, a multitier e-commerce system may have alarge number of user requests everyday, and many internalattributes accordingly react to the volume of user requests

    when the requests flow through the system [22]. Suchinternal correlations among system attributes motivate us todevelop a new approach for monitoring the high-dimen-sional data for system failure detection. We discover theunderlying low-dimensional structure of monitoring dataand extract two statistics, that is, the Hotelling T2 score and

    the squared prediction error SP E, from each measure-ment to express its variations within and outside thediscovered model. Failure detection is then carried out bytracking the probabilistic density of these statistics alongtime. Each time a new measurement comes in, we calculateits related statistics and then update their probabilisticdensity based on the newly computed values. A largedeviation of the density distribution before and after

    updating is regarded as the indication of system failure.We start with the situation where the monitoring data is

    generated from a low-dimensional linear structure. Singularvalue decomposition (SVD) is employed to discover thelinear subspace that contains the majority of data. TheHotelling T2 and SP E are then derived from the geometryfeatures of each measurement with respect to that subspace.After that, we extend our work to the case where theunderlying data structure is nonlinear, which is oftenencountered in information systems due to the nonlinearmechanisms such as caching, queuing, and resource poolingin the system. Unlike the linear model, however, there aremany new challenges in the extraction of geometric features

    from nonlinear data. For example, there are no parametricequations for globally describing the nonlinear model. Ourapproach is based on the assumption that the data lies on anonlinear (Riemannian) manifold. That is, even though themeasurements are globally nonlinear, they are often smoothand approximately linear in a local region. In the last fewyears, many manifold reconstruction algorithms have beenproposed: locally linear embedding (LLE) [32], isometricfeature mapping (ISOMAP) [35], and so on. However, thesealgorithms all focus on the problem of dimension reduction,which determines the low-dimensional embedding vectoryyi 2 R

    r of the original measurement xxi 2 Rp. In this paper,

    in order to derive the nonlinear version of Hotelling T2 and

    SP E, we need more geometric information about the data

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 1, JANUARY 2008 13

    . The authors are with NEC Laboratories America, Inc., 4 IndependenceWay, Suite 200, Princeton, NJ 08540.E-mail: {haifeng, gfj, kenji}@nec-labs.com.

    Manuscript received 27 Nov. 2006; revised 4 July 2007; accepted 5 Sept. 2007;published online 12 Sept. 2007.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TKDE-0537-1106.

    Digital Object Identifier no. 10.1109/TKDE.2007.190674.1041-4347/08/$25.00 2008 IEEE Published by the IEEE Computer Society

    Authorized licensed use limited to: NEC Labs. Downloaded on May 4, 2009 at 18:58 from IEEE Xplore. Restrictions apply.

  • 8/3/2019 Monitoring Failure

    2/13

    distribution aside from their low-dimensional representa-tion yyis. For instance, the projection of each measurement xxion the underlying manifold in the original space xxi isrequired to compute the SP E value of xxi. Furthermore, ithas been noted that current manifold reconstruction algo-rithms are sensitive to noise and outliers [4], [7]. To solvethese issues, we propose a novel approach to discovering

    the underlying geometry of nonlinear manifold for thepurpose of failure detection. We first use the linear error-in-variables (EIV) model [14] in each local region to estimatethe locally smoothed values of each point in that region.Since the local regions are overlapped, each measurementmay have several locally smoothed values. We then proposea fusion process to combine those locally smoothed valuesand obtain a globally smoothed value for each measure-ment. Those globally smoothed values are regarded as theprojections of original measurements on the underlyingmanifold and then fed to the current manifold learningalgorithms such as LLE to obtain their low-dimensionalrepresentations. Note that instead of directly using the

    original data for manifold reconstruction, we propose theEIV model and a fusion process to preprocess the data andhence achieve more robust and accurate reconstruction ofthe underlying manifold. As a by-product, we also obtainthe projections of original measurements on the underlyingmanifold in the original space that are necessary forcomputing the SP E value of each measurement.

    We also present a statistical test algorithm to decidewhether the linear or nonlinear model is suitable, given themeasurement data. After modeling, we compute the valuesof two statistics of each measurement and estimate theirprobabilistic density. The failure detection is based on thedeviation of newly calculated statistics of each comingmeasurement with respect to the learned density. Once afailure is detected, a localization procedure is proposed toreveal the most suspicious attributes based on the values ofviolated statistics. We use both synthetic data and measure-ments from a real e-commerce application to test theeffectiveness of our proposed approach. The purpose ofusing synthetic data is to demonstrate the advantages of theEIV model and fusion in reconstructing the manifold. Wecompare the results of the LLE algorithm on the originalmeasurements and the data that has been preprocessed bythe EIV model and fusion. The results show that our twoproposed procedures are both necessary to achieve anaccurate reconstruction of the nonlinear manifold. Then, wetest our failure detection and localization methods in a

    J2EE-based Web application. We collect measurementsduring normal system operations and apply both linearand nonlinear algorithms to learn the underlying datastructure. We then inject a variety of failures into the systemand compare the performances of failure detectors based onlinear and nonlinear models. It shows that both linear andnonlinear models can detect many failure incidents. How-ever, the nonlinear model produces more accurate resultscompared with the linear model.

    2 BACKGROUND AND RELATED WORK

    The purpose of this paper is to model the normal behavior

    of a system and highlight any significant divergence from

    normality to indicate the onset of unknown failures. In datamining, this is called anomaly detection or noveltydetection, and there have been a large number of literaturesin this aspect. Markou and Singh provided a detailedreview of those techniques [27], [28], in which bothstatistical and neural networks-based approaches have beenpresented. Although the statistical approaches [26], [11]

    model the data based on their statistical properties and usesuch information to test whether new samples come fromthe same distribution, the neural networks-based methods[25], [8] train a network to implicitly reveal the unknowndata distribution for detecting novelties.

    However, much of the early work makes implicitassumptions of the low dimensionality of data and doesnot work well for high-dimensional measurements. Forexample, it is quite difficult for statistical methods to modelthe density of data with hundreds or thousands of attributes.The computational complexity of neural networks is also animportant consideration for high-dimensional data. Toaddress these issues, Aggarwal and Yu [1] proposed a

    projection-based method to find the best subsets of attributesthat can reveal data anomalies. Bodik et al. [6] used a NaiveBayes approach, assuming the independent distribution ofattributes, to model the probabilistic density of high-dimensional data. Tax and Duin [34] proposed a supportvector-based approach to identifying a minimal hyper-sphere that surrounds the normal data. The samples locatedoutside the hypersphere are considered as faulty measure-ments. In this paper, our solution is based on the observationthat in information systems, the high-dimensional data areusually located on a low-dimensional underlying structurethat is embedded in high-dimensional space. Unlike theNaive Bayes approach, we discover the correlations among

    data attributes and extract the low-dimensional structurethat generates the data. Furthermore, our approach is carriedout in the original data space, without any data mappingsinto their kernel feature space, as in the support vector-basedmethods. As a consequence, we can directly analyze thesuspicious attributes once a failure has been detected.

    Detecting and localizing failures promptly is crucial tomission critical information systems. However, somespecific features of those systems introduce challenges forthe detection task. For instance, a large percentage of actualfailures in computing systems are partial failures, whichonly break down part of service functions and do not affectthe operational statistics such as response time. Such partial

    failures cannot easily be detected by traditional tools suchas pings and heartbeats [2]. To solve this issue, statisticallearning approaches have recently received a lot of attentiondue to their capabilities in mining large quantities ofmeasurements for interesting patterns that can directly berelated to high-level system behavior. For instance, Ide andKashima [20] treated the Web-based system as a weightedgraph and applied graph mining techniques to monitor thegraph sequences for failure detection. In the Magpie project[5], Barham et al. used the stochastic context-free grammarto model the requests control flow across multiplemachines for detecting component failures and localizingperformance bottlenecks. The Pinpoint project [9], a close

    relative to Magpie, proposed two features for system failure

    14 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 1, JANUARY 2008

    Authorized licensed use limited to: NEC Labs. Downloaded on May 4, 2009 at 18:58 from IEEE Xplore. Restrictions apply.

  • 8/3/2019 Monitoring Failure

    3/13

    detection: request path shapes and component interactions.For the former, the set of seen traces was modeled with aprobabilistic context-free grammar (PCFG). The latterfeature was used in building a profile for each componentsinteraction and using the 2 test for comparing the currentdistribution with the profile. In the same context of requestshape analysis as in [9], Jiang et al. [21] put forward amultiresolution abnormal trace detection algorithm usingvariable-length N-grams and automata. Bodik et al. [6] madeuse of the user access behavior as an evidence of systemhealth and applied several statistical approaches such as theNaive Bayes to mine such information for detectingapplication-level failures.

    We notice that in most circumstances, the attributescollected from information systems are highly correlated.For instance, the interaction profile of one component, asdescribed in [9], is usually correlated with those of othercomponents due to the implicit business logic or othersystem constraints. Similarly, there exist high correlationsbetween different user Web access behaviors [6]. From thisperspective, we believe that the high-dimensional measure-

    ments can be summarized in a space with much lowerdimension. We explore such underlying low-dimensionalstructure to detect system failures. Based on the learnedstructure of data, the values of Hotelling T2 and SPE arecalculated for each sample in the online monitoring. Thefailure is then detected based on the deviation of thosestatistics with respect to their distributions. Note that theHotelling T2 and SPE have already been used in chemo-metrics to understand chemical systems and processes [30],[33], [24]. For example, the Soft Independent Modeling ofClass Analogy (SIMCA), a widely used approach inchemometrics, employed the Hotelling T2 and SPE toidentify the classes of system measurements with similarhidden phenomena [33]. Kourti and MacGregor [24] pre-

    sented these statistics to monitor the chemical processes anddiagnose performance anomalies. However, those methodsall assume that the data are located in a hyperplane in theoriginal space and relied on linear methods such as theprincipal component analysis (PCA) [23] or partial leastsquares (PLS) [18] to compute the values of Hotelling T2 andSPE. In this paper, we provide a general framework forcalculating these statistics, which considers both linear andnonlinear data generation models. For the nonlinear model,we propose a novel algorithm to extract the Hotelling T2 andSPE from the underlying manifold learned from trainingdata. Furthermore, we have applied the Hotelling T2 andSPE statistics to the failure detection in distributed comput-

    ing systems. Experimental results show that these two

    statistics are effective in detecting a variety of injectedfailures in our Web-based test system.

    3 MODELING THE NORMAL DATA

    When system measurements contain hundreds of attributes,our solution for modeling their normal behavior is based on

    the fact that actual measurements are usually generatedfrom a structure with much lower dimension. Fig. 1 uses2D examples to illustrate such situations. In Fig. 1a, thenormal data (marked as x) are generated from a 1D linewith certain noise in the 2D space. Similarly, Fig. 1b showsthe data generated from an underlying 1D nonlinearmanifold. Two abnormal measurements are also plotted ineach figure (marked as ). As we can see, the abnormalsample o1 is deviated from the underlying structure.Although the abnormal sample o2 may be located in thestructure, its position in the structure is too far from those ofnormal measurements.

    It is observed in Fig. 1 that the implicit structure captures

    important information about data distribution. Such prop-erty can be exploited to distinguish abnormal and normalsamples. That is, we discover the underlying geometricmodel and reveal data variations within and outside themodel as features to detect failures. Two statistics, theHotelling T2 and SPE, are utilized to represent such datavariations. We calculate these two statistics for eachmeasurement and then build their probabilistic distributionbased on the computed values. In the monitoring process,we compute the statistics of new measurements and checktheir values with respect to the learned distribution to detectfailures. Fig. 2 provides the workflow of our normal datamodeling. We consider both linear and nonlinear datageneration models, which are described in Sections 3.1 and3.2, respectively. Section 3.3 then presents a criterion todetermine whether the linear or nonlinear model is suitablefor the available data.

    3.1 Linear Model

    If the measurements xxi 2 Rp, with i 1; ; n, are gener-

    ated from a low-dimensional hyperplane in Rp, we applythe SVD of data matrix X xx1 xxn:

    X UV> UssV>

    s UnnV>

    n ; 1

    where

    diag1; ; r; r1; ; 2 Rpn;

    CHEN ET AL.: MONITORING HIGH-DIMENSIONAL DATA FOR FAILURE DETECTION AND LOCALIZATION IN LARGE-SCALE COMPUTING... 15

    Fig. 1. Two-dimensional data examples. (a) Linear case. (b) Nonlinear

    case.

    Fig. 2. The workflow of normal data modeling.

    Authorized licensed use limited to: NEC Labs. Downloaded on May 4, 2009 at 18:58 from IEEE Xplore. Restrictions apply.

  • 8/3/2019 Monitoring Failure

    4/13

    and 1 ! r ) r1 ! , with minfp;ng. The twoorthogonal matrices U and V are called the left and righteigenmatrices of X. Based on the magnitude of singularvalues i, the space ofX is decomposed into signal and noisesubspaces. The left r columns ofU, Us u1;u2; ur, formthe bases of signal space, and s diag1; ; r. Anyvector xx 2 Rp can be represented by the summation of two

    projection vectors from two subspaces:

    xx xx ~xx; 2

    where xx UsU>s xx is the projection of xx on the hyperplane

    expressed in the original space, which represents the signalpart ofxx, and ~xx contains the noise. Meanwhile, we can alsoobtain the low-dimensional representation of xx in the signalsubspace:

    yy U>s xx: 3

    The vector yy 2 Rr is called theprincipal component vector ofxx,which represents the r-dimensional coordinates of xx in thesignal subspace. The ith element in yy is called theith principal component of xx. Since the principal compo-nents are uncorrelated and the variance of the ith componentyi is

    2i [16], the covariance ofyy is Cy diag

    21; ;

    2r . Note

    that for ease of explanation, we assume that the data xxis arecentered. In real situations, we need to center the data beforethe above calculations.

    Two statistics are defined for each sample xx torepresent its variations within and outside the signalsubspace. One is the Hotelling T2 score [23], which isexpressed as the Mahalanobis distance from xxs principalcomponents yy to the mean of principal component vectorsfrom the training data:

    T2

    yy "yy>

    C1y yy "yy; 4

    where C1y is the inverse matrix of Cy. Since "yy 0, we cansimplify (4) as T2 yy>C1y yy. Another statistic, the SPE [23],indicates how well the sample xx conforms to the hyper-plane, measured by the euclidean distance between xx andits projection xx on the hyperplane expressed in the originalspace:

    SP E k~xxk2 kxx xxk2: 5

    The intuition of using these two statistics for failuredetection is illustrated in Fig. 3, in which 2D normalsamples, marked as x, are generated from a line

    (1D subspace) with certain noises. Through subspacedecomposition, we obtain the direction of the line (thesignal subspace) and then project each sample xx onto thatline to get xx. In this case, the Hotelling T2 score representsthe Mahalanobis distance from xx to the origin (0, 0). Thevalue of SPE is the squared distance between xx and xx. Twoabnormal samples, marked as , are also shown in Fig. 3a.Since the sample o1 is far from the line, its SPE is muchlarger than those of other points. Although the sample o2has a reasonable SPE value, its Hotelling T2 score is verylarge, since its projection on the line o2s is far from thecluster of projected normal samples. We plot the histogramsof the Hotelling T2 score and SPE value for all the samples,

    as shown in Figs. 3b and 3c, respectively. Based on these

    histograms, we conclude that by defining suitable bound-

    aries for normal samples in the extracted statistics, we canfind abnormal measurements and hence detect failures.

    3.2 Nonlinear Model

    When the measurements xxis are generated from a nonlinearstructure, we still desire to have statistics that serve thesame purpose of Hotelling T2 and SPE in the linear model.

    To derive the nonlinear version of these two statistics, our

    approach is based on the assumption that the data lies on anonlinear (Riemannian) manifold. We discover the under-lying manifold of high-dimensional measurements and

    then define the corresponding statistics based on thegeometric features of each sample with respect to thediscovered manifold.

    According to the original definition of Hotelling T2 in (4)and SPE in (5), we need the following information to gettheir nonlinear estimates:

    . the low-dimensional embedding vector yy of theoriginal measurement xx, where yy represents the low-dimensional coordinates of xx in the underlying

    manifold instead of the linear hyper plane, and. the projection xx of measurement xx on the manifold

    in the original space.

    If we have the values of these variables, the nonlinearversion of Hotelling T2 and SP E for each sample xx can bedefined in the same form as that in the linear situation. For

    instance, the nonlinear T2 is expressed as in (4), except thatthe yy is computed from the underlying manifold, "yy is themean of yy for all the training data, and Cy denotes the

    sample covariance matrix:

    Cy 1

    n 1Xn

    j1

    yyj "yyyyj "yy>: 6

    16 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 1, JANUARY 2008

    Fig. 3. The role of extracted statistics for failure detection. (a) The

    2D normal data together with two outliers. (b) The histogram of Hotelling

    T2 of all the samples. (c) The histogram of SP E of all the samples.

    Authorized licensed use limited to: NEC Labs. Downloaded on May 4, 2009 at 18:58 from IEEE Xplore. Restrictions apply.

  • 8/3/2019 Monitoring Failure

    5/13

    Similarly, the nonlinear SPE for sample xx is also defined in(5), with xx representing the projection ofxx on the manifold.

    In the last few years, many manifold reconstructionalgorithms have been proposed, such as the LLE [32] andISOMAP [35]. However, these algorithms all focus on theproblem of dimension reduction, which only outputs thelow-dimensional embedding vector yy of the original

    measurement xx and does not directly compute the projec-tion xx of xx on the manifold. Furthermore, in practicalsituations, the measurements are always noisy, and it isnoted that the LLE and ISOMAP algorithms are sensitive tonoise [4], [7]. As a consequence, we use two steps to obtainthe necessary estimates:

    xxi 1

    xxi 2

    yyi i 1; ; n: 7

    We start by calculating the projection xxi of each sample xxi onthe underlying manifold, followed by the estimation of yyisbased on the projected variables. Note that here, we use theestimate xxi as an intermediary for obtaining yyi. The accuracy

    of the estimate xxi is therefore important to the computationof two statistics. To get the reliable xxis, we first apply thelinear EIV model, as described in Section 3.2.1, in theneighborhood region of each sample xxi to compute thelocally smoothed value of every sample in that neighbor-hood. Since the local regions are overlapped, each sampleusually has several locally smoothed values. We thenpresent a fusion process in Section 3.2.2 to combine all thelocally smoothed values of xxi and hence obtain itsprojection xxi on the manifold in the original space. Oncethe projections xxis are available, they are input to currentmanifold learning algorithms such as LLE to estimate thelow-dimensional embedding vector yyis, which is described

    in Section 3.2.3.

    3.2.1 Local Error-in-Variables (EIV) Model

    We start with the k-nearest neighbor search for eachsample xxi to define its local region. Given a set of

    p-dimensional vectors xxi1 ; ; xx

    ik located in the neighbor-

    hood of xxi, a local smoothing of those vectors is performedbased on the geometry of the local region that comprises thepoints. For simplicity, we use xx1; ; xxk to denote thoseneighborhood points of xxi. In current manifold reconstruc-tion algorithms such as LLE, the local geometry isrepresented by a weight vector ww wi1 wik

    > thatbest reconstructs xxi from its neighbors. By minimizing the

    following reconstruction error

    xxi Xkj1

    wijxxj

    2

    Xpl1

    xil Xkj1

    wijxjl

    !28

    subject toP

    j wij 1, where xil or xjl represents thelth element of vector xxi xxj, the LLE obtains the leastsquares solution of ww for the local region surrounding xxi.Equation (8) assumes that all the neighbors of xxi are free ofnoise, and only the measurement xxi is noisy. This isfrequently unrealistic, since usually, all the samples arecorrupted by noise. As a result, the solution of (8) is biased.To remedy this problem, we use the following EIV model

    [14] by minimizing

    0 xxi Xkj1

    wijxxj

    2

    Xkj1

    kxxj xxjk2 9

    subject toP

    j wij 1, where xxj is the local noise-freeestimate of sample xxj in the region surrounding xxi.

    Equation (9) can also be represented as

    0 kxxi xxik2

    Xkj1

    kxxj xxjk2 10

    by taking into account that xxi P

    j wijxxj. Define thematrices

    B xxi xx1 xx2 xxk 2 Rpk1

    and E xxi xx1 xx2 xxk 2 Rpk1 to contain the locally

    smoothed estimates of those points, with p ) k, forcommon cases where the nonlinear manifold is embeddedin a high-dimensional space. The problem (10) can bereformulated as

    min kB Ek2

    11

    subject to

    E 0; 12

    where 1 wi1 wi2 wik>. From (12), the rank ofE

    is k. Therefore, the estimate E is the rank k approximation

    of matrix B. If the SVD of B is BPk1

    j1 jujv>

    j , with

    1 ! 2 ! k1, we obtain the noise-free sample matrix E

    from the Eckart-Young-Mirsky Theorem [12], [29] as

    E

    Xk

    j1

    jujv>

    j : 13

    The weight vector ww can also be estimated from the SVDof matrix B. Since our purpose here is to find the localnoise-free estimate E, we do not plan to discuss it in detail.For an in-depth treatment of the EIV model and itssolutions, see [14], [36]. According to [36], we can alsoobtain the first-order approximation of the covariance Cw ofthe parameter ww, which is proportional to the estimation ofthe variance of noise 2

    2k1=p k.

    3.2.2 Fusion

    We apply the EIV model in the neighborhood region ofevery sample xxi to obtain the locally smoothed value of

    every point in that region. Since each sample xxi is usuallyincluded in the neighborhoods of several other points andin its own local region, it has more than one local noise-freeestimate. Given those different estimates fxx

    1i ; xx

    2i ; ; xx

    hi g

    with h ! 1 our goal is to find a global noise-free estimate xxiof xxi from its many local values. Such global noise-freeestimate xxi can be regarded as the projection of xxi onto themanifold in the original space.

    Due to the variation of curvatures of the underlyingmanifold, the linear model presented in Section 3.2.1 maynot always succeed in discovering local structures. Forinstance, in the local region with a large curvature, the noise-free estimate xx

    i is not reliable. Suppose we have the

    covariance matrix Cj of^xx

    j

    i to characterize the uncertainty of

    CHEN ET AL.: MONITORING HIGH-DIMENSIONAL DATA FOR FAILURE DETECTION AND LOCALIZATION IN LARGE-SCALE COMPUTING... 17

    Authorized licensed use limited to: NEC Labs. Downloaded on May 4, 2009 at 18:58 from IEEE Xplore. Restrictions apply.

  • 8/3/2019 Monitoring Failure

    6/13

    linear fitting, through which xxji is obtained. Then, the global

    estimate of a noise-free value xxi can be found by minimizing

    the sum of the following Mahalanobis distances:

    xxi argminxi

    Xhj1

    xxi xxji

    >C

    1j xxi xx

    ji : 14

    The solution of (14) is

    xxi Xhj1

    C1

    j

    !1Xhj1

    C1

    j xxji : 15

    That is, the global estimate xxi is characterized by the

    covariance weighted average of xxji s. The more uncertain a

    local estimate xxji is (the inverse of its covariance has a

    smaller norm), the less the instance that it contributes to the

    result of the global value xxi.The covariance Cj can be estimated from the error

    propagation of the covariance of ww during the calculation

    of xxji . However, since the dimension of xx

    ji is high, it is

    not feasible to directly use Cj. Instead, we use thedeterminant of Cj to approximate (15):

    xxi Xhj1

    jCjj1

    !1Xhj1

    jCjj1xx

    ji ; 16

    where jCjj is approximated by jCjj % 2 , and is a constant

    that does not contribute to the calculation of (16).

    3.2.3 Manifold Reconstruction

    Once we have the estimate xxis, we feed them to the currentmanifold reconstruction algorithms such as LLE and ISO-

    MAP to obtain thelow-dimensional embedding vector yyis. In

    our approach, the EIV modeling and fusion process actuallyserve as preprocessing procedures for those manifold

    learning algorithms in order to achieve a more robust and

    accurate reconstruction of the nonlinear manifold. One by-

    product of this is that we can also obtain the projection ofeach measurement on the manifold in the original space. We

    do not plan to describe the LLE or ISOMAP algorithm here.

    The readers may refer to [32], [35] for details.When the projection xxis and the low-dimensional

    embedding vector yyis are available, the nonlinear Hotelling

    T2 and SP E of xxis are calculated in the same way as in (4)

    and (5). We then use the Gaussian mixture model to

    estimate the density distribution of two statistics. Section 4will show that by monitoring these statistics along time, we

    can detect abnormal points that deviated from the under-

    lying manifold.

    3.3 Linear versus Nonlinear

    The linear model is easy to understand and simple to

    implement. On the other hand, nonlinear models are more

    accurate when the nonlinearities of the underlying structure

    are too pronounced to be approximated by linear models. To

    make the correct decision of choosing the model, given a setof measurements, we first estimate the intrinsic dimension of

    data and then apply a statistical test to decide whether the

    linear or nonlinear model has to be used.

    We use the method proposed in [17] to estimate theintrinsic dimension. It is based on the observation that foran r-dimensional data set embedded in the p-dimensionalspace, xxi 2 R

    p, with i 1; ; n, the number of pairs ofpoints closer to each other than is proportional to r. Wedefine

    Hn 2nn 1

    Xni

  • 8/3/2019 Monitoring Failure

    7/13

    of variations in system behavior. To deal with such systemswith changing behavior, Section 4.2 provides anothersolution to detect failures. We employ an online updatingalgorithm to sequentially update the distribution of twostatistics for every sample in the monitoring process. Thefailure is then detected based on the statistical deviation ofdensity distribution before and after the updating.

    For the linear model, it is easy to compute theHotelling T2 and SP E for each new sample xxt based on(4) and (5). However, calculating those statistics online isnot straightforward in the nonlinear case, since there are noglobal formula or parameters to describe the nonlinearmanifold. Therefore, Section 4.1 focuses on the onlinestatistics computing for the nonlinear model. Once thenewly computed statistics are available, Section 4.2 thenpresents the way of updating their density and detectinganomaly based on the updated density.

    4.1 Online Computing of Two Statistics

    Fig. 4 presents the algorithm of the online computing ofHotelling T2 and SP E for the nonlinear model. Given anew measurement xxt, the first step is to find the nearestpatch of xxt on the discovered manifold from the trainingdata. We start by locating the nearest point xx of xxt in thetraining data together with its nearest neighbors (includingxx itself) xx1; xx

    2; ; xx

    k, followed by retrieving their projec-

    tions on the manifold xx1; ; xxk. The plane spanned by xx

    ks

    is then regarded as the nearest patch of xxt on the manifold.Note that the nearest neighbors of xx and their projectionshave already been calculated during the training phase, andno extra calculations are needed. The only issue is findingthe nearest point xx of xxt in the training data. For high-dimensional xxt, the complexity of nearest neighbor query ispractically linear with respect to the training data size. Tospeed up computation, the locally sensitive hashing (LSH)[15] data structure is constructed for the training data toapproximate the nearest neighbor search. As a consequence,the nearest neighbor query time has sublinear dependenceson the data size.

    Step 2 in Fig. 4 then projects xxt onto its nearest patch.

    We assume that the patch spanned by^xx

    i s is linear and

    builds a matrix X xx1; ; xxk. By solving the equation

    Xww xxt by least squares, we get the estimation ww

    X>X1X>xxt and the projection of xxt on the manifold:

    xxt Xww XX>X1X>xxt: 22

    Once we obtain the weight ww, the low-dimensionalembedding vector yyt of xxt is calculated by (21) based onthe observation that the local geometry in the original spaceshould equally be valid for local patches in the manifold.Accordingly, the Hotelling T2 and SP E values of the newsample xxt are calculated.

    4.2 Density-Based Detection

    Once the newly calculated Hotelling T2 and SP E areavailable, we use the sequentially dynamic expectation-maximization (SDEM) algorithm [26], as described inFig. 5, to update the density (20). Based on the originalexpectation-maximization (EM) algorithm for Gaussianmixture models [31], the SDEM utilizes an exponentiallyweighted moving average (EWMA) filter to adapt to frequentsystem changes. For instance, given a set of observationsfx1; x2; ; xn; g, an online EWMA filter of the mean isexpressed as

    n1 1 n xn1; 23

    where the forgetting parameter dictates the degree ofdiscounting previous examples. Intuitively, the larger is,the faster the algorithm can age out past examples. Notethat there is another parameter in Fig. 5, which is setbetween [1.0, 2.0], in the estimation ofi in order to improvethe stability of the solution.

    The anomaly is then determined based on the statistical

    deviation of density distribution before and after the new

    statistics zzt is obtained. If we denote the two distributions as

    pt1z and p

    tz, respectively, our metric called the Hellinger

    score is defined by

    sHzzt

    Z ffiffiffiffiffiffiffip

    tz

    q

    ffiffiffiffiffiffiffiffiffiffiffip

    t1z

    q 2dz: 24

    Intuitively, this score measures how much the probability

    density function ptz has moved from p

    t1z after learning zzt.

    A higher score indicates that zzt is an outlier with high

    probability. For the efficient computation of the Hellinger

    score, see [26].

    CHEN ET AL.: MONITORING HIGH-DIMENSIONAL DATA FOR FAILURE DETECTION AND LOCALIZATION IN LARGE-SCALE COMPUTING... 19

    Fig. 4. Online computing of Hotelling T2 and SP E for the nonlinear

    model.

    Fig. 5. The SDEM algorithm for updating the mixing probability cti and

    the mean ti and covariance

    ti of k Gaussian functions in (20), given

    the new statistics zzt.

    Authorized licensed use limited to: NEC Labs. Downloaded on May 4, 2009 at 18:58 from IEEE Xplore. Restrictions apply.

  • 8/3/2019 Monitoring Failure

    8/13

    5 FAILURE LOCALIZATION

    This section discusses finding out a set of the mostsuspicious attributes, after a failure has been detected, based on their relation to the failure. Although thesereturned attributes may not tell the exact root cause of thefailure, they still contain useful clues and can therebygreatly help system operators in narrowing down the

    debugging scope to some highly probable components.We collect and analyze the failure measurement xxf to

    determine which of its statistics gets wrong by comparingeach statistic with its marginal distribution derived from the joint density (20). If the statistic SP E deviates from itsdistribution, we use the ranking method in Section 5.1 toreturn the most suspicious variables. Similarly, if theHotelling T2 goes wrong, we use the method in Section 5.2to rank the variables. If both statistics get wrong, the unionof two sets of the most suspicious variables is returned.

    5.1 Variable Ranking from SP E

    Given the failure measurement xxf, according to (5), its SP Erepresents the squared norm of residual vector ~xxf xxf xxf.Our method for finding the most suspicious attributes fromthe deviated SPE is based on the absolute value of eachelement in the residual ~xxf. If the ith element in ~xxf has largeabsolute values, then the ith attribute is one of the mostsuspicious attributes.

    However, the direct comparison of each element in ~xxf isnot reliable, since the attributes have different scales. Inorder to prevent outweighing one attribute with otherattributes, we record the absolute values of each element inthe residual vectors of training samples and calculate theirmean ~m and the standard deviation ~. The ith element in ~xxfis then transformed to a new variable:

    si j~xfi j ~mi~i

    ; 25

    where ~mi and ~i represent the ith element of ~m and ~,respectively. A large si value indicates the high importanceof the ith attribute to the detected failure.

    5.2 Variable Ranking from Hotelling T2

    For the linear model, the Hotelling T2 in (4) can be

    simplified as T2 yy>C1y yy Pr

    i1y2i2i

    . We define

    vt y2121

    ; ;y2r2r

    >; 26

    in which theith

    element represents the importance of theith principal component yi to the T2 statistic. However, theprincipal components usually do not have any physicalmeaning. To reveal failure evidence in the attribute level,we compute the contribution of the original variable xi toeach principal component in terms of extracted variancesand denote it as vi. It is expected that the variable whose vihas the same element distribution as vt is the mostsuspicious. Therefore, we compute

    ti v>i vt 27

    for the ith attribute and reveal the suspicious attributesbased on tis. If the value of ti is large, then the ith attribute

    needs more attention in the debugging.

    In order to calculate vi for each variable xi, we considerthe principal component loading L 2 Rnr [3], in whicheach element lij represents the correlation between theith variable xi and the j principal component yj:

    lij Exiyj: 28

    The matrix L can be computed from the SVD of data matrix

    X [3], and the square of its element l2i;j tells the proportion ofvariance in the original variable xi explained by theprincipal component yj. Therefore, we define a matrix Mwith element Mij l2ijvarxi to represent the actualvariance in xi that is explained by yj. The summation ofthe jth column in M, fj

    Pi Mij, represents the total

    variance extracted by the jth principal component yj. Wedivide each element Mij by fj to obtain a new matrix ~Mwith each element ~Mij Mij=fj. The ith row of matrix ~Mthen represents the contribution of the original variable xi toeach principal component in terms of extracted variances:

    vi ~Mi1 ~Mir>: 29

    For the nonlinear model, however, the low-dimensionalembedding vector yy cannot be expressed as the linearcombination of original attributes. In order to apply theabove variable ranking method to nonlinear situations, weperform SVD on the nearest patch of xxf, X xx1; ; xx

    k,

    as discovered in Section 4.1. By doing so, we obtain xxfs localprincipal components yyf. The loading matrix L and, hence,the vis are then calculated from the data matrix X

    and yyf,and vt is computed from yyf. Accordingly, the ti values in(27) can be calculated to rank the variables.

    6 EXPERIMENTAL RESULTS

    In this section, we first use some synthetic data todemonstrate the advantages of the local EIV modelingand fusion process, as described in Sections 3.2.1 and 3.2.2,in achieving a more accurate reconstruction of the nonlinearmanifold. We also evaluate both the linear and nonlinearmethods in detecting a set of generated outliers. Then, weapply our proposed high-dimensional data monitoringapproach to a real J2EE-based application to detect avariety of injected system failures.

    6.1 Synthetic Data

    In addition to providing an original framework of monitor-ing high-dimensional measurements in information sys-tems, there are also novel algorithmic contributions to the

    nonlinear manifold recovery in this paper. We haveproposed the local EIV model and a fusion process toreduce the noise in the original measurements and henceachieve a more accurate reconstruction of the underlyingmanifold. In Section 6.1.1, we use synthetic data todemonstrate the effectiveness of our proposed algorithms.In addition, we generate some outliers to evaluate thedetection performance of both linear and nonlinear modelsin Section 6.1.2.

    6.1.1 Manifold Reconstruction

    For ease of visualization, a 1D manifold (curve) is used inthis example. The 400 data points are generated by gt

    t cost; t sint

    >

    added with a certain amount of Gaussian

    20 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 1, JANUARY 2008

    Authorized licensed use limited to: NEC Labs. Downloaded on May 4, 2009 at 18:58 from IEEE Xplore. Restrictions apply.

  • 8/3/2019 Monitoring Failure

    9/13

    noise, where t is uniformly sampled in the interval 0; 4.

    Fig. 6a shows such curve under the noise with standard

    deviation 0:3. Since in this example, the data dimension

    p 2 is smaller than the size of the neighborhood k (=12

    for this data set), we use a regularized solution of the EIV

    model [13] to calculate the matrix E in (13).The reconstruction accuracy for 1D manifold is mea-

    sured by the relationship between the recovered manifold

    coordinate ~ and centered arc length t defined as

    t Z

    t

    t0

    kJgtkdt; 30

    where Jgt is the Jacobian of gt:

    Jgt cost t sint; sint t cost>: 31

    The more accurate the manifold reconstruction is, the more

    linear the relationship between ~ and becomes. Figs. 1b,

    1c, and 1d show their relationship curves generated by the

    LLE algorithm, LLE with EIV modeling, and LLE with both

    EIV modeling and fusion. It is obvious that the LLE with

    both EIV modeling and fusion outperforms the other two

    algorithms. Note that in the LLE algorithm with only EIV

    modeling, the estimate of the projection xx is taken as thelocally smoothed value from the region that surrounds xx.

    Further comparison of the performances of the three

    algorithms are carried out by some random simulations on

    manifolds in the 100-dimensional space. We generate the

    same 2D data gtis as in the first example, followed by

    transforming the data into 100-D vectors by orthogonal

    transformation xxi Qgti, where Q 2 R1002 is a random

    orthonormal matrix. We add different levels of noise on the

    data, with the standard deviation ranging from 0.1 to 1. At

    each noise level, 100 trials are run. We use the correlation

    coefficient between the recovered manifold coordinate ~

    and the centered arc length

    cov~; ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

    var~varp 32

    to measure the strength of their linear relationship. Fig. 7shows the mean and standard deviation of correlationcoefficient , as obtained by LLE, LLE with EIV modeling,and LLE with EIV modeling and fusion, for 100 trials underdifferent noise levels, with standard deviation rangingfrom 0.1 to 1. Note that the vertical bars in this figure markone standard deviation from the mean. It illustrates thatboth the EIV modeling and fusion are beneficial to reducingthe noise of data samples.

    6.1.2 Outlier Detection

    We use the data from a highly nonlinear structure toillustrate the effectiveness of manifold reconstruction andthe two statistics in detecting outliers. We generate 1,000 3Ddata zi from a 2D Swiss roll, which is shown in Fig. 8a. Those3D data are thentransformed into 100-dimensional vectors xxiby an orthogonal transformation xxi Qzi, where Q 2 R

    1003

    CHEN ET AL.: MONITORING HIGH-DIMENSIONAL DATA FOR FAILURE DETECTION AND LOCALIZATION IN LARGE-SCALE COMPUTING... 21

    Fig. 6. (a) Samples from a noisy 1D manifold. (b) The centered arclength versus manifold coordinate ~ recovered by LLE. (c) Thecentered arc length versus manifold coordinate ~ recovered by LLEwith the EIV modeling. (d) The centered arc length versus manifoldcoordinate ~ recovered by LLE with the EIV modeling and fusion.

    Fig. 7. Performance comparison of LLE, LLE with EIV, and LLE with EIV

    and fusion in noise reduction. The vertical bars mark one standard

    deviation from the mean.

    Fig. 8. The comparison of linear and nonlinear models in outlierdetection. (a) The 3D view of training data. (b) The 3D view of testsamples (marked as ) and the training data (marked as .).(c) The scatterplot of the SP E and T2 statistics of test samplescomputed by the linear method. The normal test samples are markedas . and the outliers are marked as x. (d) The scatterplot of theSP E and T2 statistics of test samples computed by the nonlinearmethod. The normal test samples are marked as . and the outliersare marked as x.

    Authorized licensed use limited to: NEC Labs. Downloaded on May 4, 2009 at 18:58 from IEEE Xplore. Restrictions apply.

  • 8/3/2019 Monitoring Failure

    10/13

    is a random orthonormal matrix. Some Gaussian noises withstandard deviation 0.5 are also added to the 100D vectors.Meanwhile, we also generate 200 test samples, in which halfof them are normal data generated in the same way astraining data generation, and the others are randomlygenerated outliers. Fig. 8b presents a 3D view of the testdata (marked as ) and the training data.

    We apply both linear PCA and the proposed manifoldreconstruction algorithms to reconstruct the underlyingstructure from the training data. Based on the learnedstructure, the T2 and SP E statistics are computed for thetraining data and test samples. Figs. 8c and 8d present thescatterplot of two statistics of test data computed from the

    linear and nonlinear methods, respectively, in which thenormal data are plotted as . and the outliers are markedas x. In Fig. 8c, we see that the normal samples andoutliers are highly overlapped. The linear method showspoor performance in detecting the outliers because of thehigh nonlinearity of data structure. On the other hand, theresults from the nonlinear method present a clear separa-tion between normal samples and outliers, which is shownin Fig. 8d. In this figure, we notice that the SP E statisticplays an important role in distinguishing outliers from thenormal data. This is reasonable, because the outliers areusually randomly generated, and it is less likely for them tobe located in the same manifold structure with the inliers.

    However, the T2 statistic, which measures the in-modeldistances among samples, is also useful in the outlieridentification. As shown in Fig. 8d, two outliers (8.7, 86.9)and (8.4, 108.1) exhibit small SP E values but largeT2 scores. We check those two points in the original3D data and find that they are actually located in themanifold of inliers but far from the cluster of the inlier data.In this case, the T2 statistic provides good evidence aboutthe existence of outliers.

    6.2 Real Test Bed

    Our algorithms have been tested on a real e-commerceapplication, which is based on the J2EE multitiered

    architecture. J2EE is a widely adopted platform standardfor constructing enterprise applications based on deployableJava components, called Enterprise Java Beans (EJBs). Thearchitecture of our testbed system is shown in Fig. 9. We useApache as a Web server. The application server consists ofthe Web container (Tomcat) and the EJB container (JBoss).The MySQL is running at the back end to provide persistentstorage of data. PetStore 1.3.2 is deployed as our testbedapplication. Its functionality consists of storefront, shoppingcart, purchase tracking, and so on. There are 47 componentsin PetStore, including EJBs, Servlets, and JSPs. We build aclient emulator to generate a workload similar to that createdby typical user behavior. The emulator produces a varying

    number of concurrent client connections, with each client

    simulating a session based on some common scenarios,which consists of a series of requests such as creating new

    accounts, searching by keywords, browsing for item details,updating user profiles, placing orders, and checking out.

    The monitored data are collected from three servers(Web sever, application server, and database sever) in ourtestbed system. Each server generates measurements from avariety of sources such as CPU, disk, network, andoperating systems. Fig. 10 lists all these attributes, whichare divided into eight categories. The three right columns inthis figure give the number of attributes in each categorygenerated in three servers, respectively. In total, there are111 attributes contained in each measurement. We manu-ally check these attributes and observe that many of themare correlated. Fig. 11 presents an example of four highly

    correlated attributes. It suggests that our proposed ap-proach is feasible to this type of data.

    We collect the measurements every 5 seconds undersystem normal operations, with the magnitude of workloadrandomly generated between 0 and 100 user requests persecond. In total, 5,000 data samples are gathered as thetraining data. To determine whether the linear or nonlinearmodel best characterizes that data set, we calculate Hnifor different is, as described in Section 3.3, fit a linebetween their log values logHni 5:83 log i 15:14, and

    22 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 1, JANUARY 2008

    Fig. 9. The architecture of the testbed system.

    Fig. 10. The list of attributes from the testbed.

    Fig. 11. Example of four correlated attributes.

    Authorized licensed use limited to: NEC Labs. Downloaded on May 4, 2009 at 18:58 from IEEE Xplore. Restrictions apply.

  • 8/3/2019 Monitoring Failure

    11/13

    get the intrinsic dimension r 5:83 6. Since the calcu-lated 0:92 from (19) is smaller than the threshold 0.98, itis suggested that the nonlinear model be applied for thedata. In the following, we will confirm this conclusion bycomparing the performances of the linear and nonlinearmodels in detecting a variety of injected failures in thesystem.

    We modify the codes in some EJB components of thePetStore application to simulate a number of real systemfailures. Five types of faults are injected into variouscomponents, with different intensities, to demonstrate therobustness of our approach.

    Memory Leaking. We simulate three memory leakingfailures by repeatedly allocating three different sizes(1 Kbyte, 10 Kbytes, and 100 Kbytes) of heap memory intothe ShoppingCartEJB of the PetStore application. Since thereference of that EJB object is always pointed from otherobjects, the Java garbage collector does not notice thisleakage of memory. Hence, the PetStore application willgradually exhaust the supply of virtual memory pages,

    which leads to severe performance issues and makes theaccomplishment of client requests much slower.File Missing. In the packaging process of Java Web

    applications, it might happen that a file is improperlydropped from the required composition, which will resultin failures of invoking a correct system response, and mayeventually cause service malfunction, which makes the usercome across strange Web pages. Here, we simulate five suchfailures by dropping different JSP files from the PetStoreapplication to mimic the operators mistakes during systemmaintenance.

    Busy Loop. The actual causes of request slowdown can be quite a few such as the spinlock fault among synchro-nized threads. We simulate the phenomenon of slowdown

    by adding a busy-loop procedure in the code. Dependingon the number of loops in the instrumentation, thesignificance of simulation is different. In this section, wesimulate five different busy-loop failures by allocating 30,65, 100, 150, and 300 loops in the ShoppingCartLocalEJB ofthe PetStore application, respectively.

    Expected Exception. The expected exception [9] happenswhen a method declaring exceptions (which appears in themethods signature) is invoked. In this situation, anexception is thrown, without the methods code beingexecuted. As a consequence, the user may encounterstrange Web pages. We inject such fault into two differentEJBs of PetStore, that is, CatalogEJB and AddressEJB, to

    generate two expected exception failures.

    Null Call. The null call fault [9] causes all methods in theaffected component to return a null value without executingthe methods code. It is usually caused by the errors inallocating system resources, failed lookups, and so on.Similar as the expected exception, the null-call failureresults in strange Web pages. We inject such fault into twodifferent EJBs of PetStore, that is, CatalogEJB and Addres-

    sEJB, to generate two failure cases.As a result, altogether, 17 failures cases are simulated

    from the five different types. Note that the system isrestarted before every failure injection in order to removethe impact of previous injected failures. In addition, theworkloads are dynamically generated with much random-ness so that we never get a similar workload twice in theexperiments. We randomly collect a certain number ofmeasurements from each failure case and in total obtain425 abnormal measurements. We also collect 575 normalsamples to make the test data set contain 1,000 samples.

    The linear and nonlinear models are compared inrepresenting the training data and detecting failure samplesfrom the test data. Figs. 12a and 12b present the scatterplots

    of Hotelling T2 and SP E of the test data produced by thelinear and nonlinear models, respectively. The values ofnormal test data are marked as . in the figures, and thoseof failure data are marked as . For the linear modelshown in Fig. 12a, there is an overlap between the normaldata distribution and that of the failure samples. Inaddition, there are four normal points with very large SPEvalues (at around 120). We check those points in Fig. 12bprovided by the nonlinear method and find that amongthose four points, three are located in the cluster of normalsamples, and only one point, (39.5, 2.1), is hard to separatefrom outliers. Compared with the linear model, the non-linear model produces more clear separation betweennormal and abnormal samples in the generated statistics.

    We also notice that for the nonlinear model, the SPE statisticplays a dominant role in detecting the outliers. Based on thesimilar observation that we obtained in Fig. 8d from thesynthetic data, we can explain this by two factors: 1) thenonlinear method correctly identifies the underlying datastructure and 2) in the experiment, most failure points arelocated outside the discovered manifold. In spite of theimportance of SPE, the T2 statistic is also useful in failuredetection, especially when the SPE values are in theambiguity region, for example, between 27 and 37 forthe nonlinear model. For the linear method, such ambi-guity region of SPE is wider, which is from 15 to 35, asshown in Fig. 12a, because of its linear assumption ofunderlying data structure. In order to qualitatively comparethe performances of two detectors, we use the methoddescribed in Section 4.2 to build the joint density of T2 andSP E based on values computed from the training data andcalculate the Hellinger score (24) for every test sample.Based on these scores, the ROC curves of the two modelsare plotted in Fig. 13. It shows that both the linear andnonlinear methods obtain acceptable results in detecting thefailure samples due to the moderate nonlinearity of datagenerated in the experiment. However, the nonlinear modelproduces more accurate results than the linear model.

    A further investigation of the nonlinear model revealsits more advantages over the linear method. We find thatthe values of T2 and SP E calculated by the nonlinear

    model can also provide useful clues about the significance

    CHEN ET AL.: MONITORING HIGH-DIMENSIONAL DATA FOR FAILURE DETECTION AND LOCALIZATION IN LARGE-SCALE COMPUTING... 23

    Fig. 12. The scatterplot of SPE and T2 of the test data produced by the

    (a) linear model and (b) the nonlinear model. The normal test data are

    marked as . and the failure data are marked as .

    Authorized licensed use limited to: NEC Labs. Downloaded on May 4, 2009 at 18:58 from IEEE Xplore. Restrictions apply.

  • 8/3/2019 Monitoring Failure

    12/13

    of injected failures. Fig. 14 uses the values of SP E on thebusy-loop failure to demonstrate this fact. Fig. 14b showsthe histogram of SP E values of normal test data generatedby the nonlinear model. Figs. 14d, 14f, and 14h present theSP E of test samples from three busy loop failure cases with

    different impacts, in which 30, 65, and 100 busy loops areinjected into an EJB component, respectively. Results showthat the SP E values for these failures are separated, andthe failure with stronger significance produces largerSP E values. The SP Es computed by the linear modelare also shown in Figs. 14a, 14c, 14e, and 14g. Comparedwith those from the nonlinear model, the SP E values fromthe linear model are overlapped and lack strong evidenceabout the significance of injected failures.

    Our failure localization procedure also produces satis-factory results. Here, we use variable ranking from SP E todemonstrate this. We randomly select 200 samples from the

    failure measurements whose SP E values are affected.

    Among the 200 selected data, each type of failure occupies

    40 samples, which are continuously indexed. We apply the

    attribute ranking method described in Section 5 and output

    a vector i, which contains the indices of the five most

    suspicious attributes. In total, we generate 200 such vectors

    i, i 1; ; 200. In order to see whether these attribute

    ranking results really tell any evidence about the injected

    failure, we perform hierarchical clustering on is, in which

    the Jaccard coefficient J i; j

    ji\jj

    ji[jj is used for calculatingthe similarity between i and j. We start by assigning each

    vector i to its own cluster and, then, we merge clusters

    with the largest average similarity until five clusters are

    obtained. Results show that the vector is from the same

    type of failure provides consistent variable ranking results.

    Table 1 plots the cluster indices associated with each failure

    measurement. We see that all the vectors belonging to the

    memory leaking (with indices 1-40) and file missing (41-80)

    failures form separate clusters. Most of the vectors from the

    busy-loop failure (81-120) form one cluster, except five noisy

    points. However, it is hard to separate the null call (121-160)

    and expected exception (161200) failures. Actually, this is

    reasonable, since these two types of failures are generated by similar mechanisms. Therefore, our proposed failure

    localization method provides consistent and useful evi-

    dence about the failure. It is our future work to look further

    into the clustered vectors and reveal the signatures for each

    type of failure based on its suspicious attributes. By doing

    so, we can quickly identify and solve recurrent system

    failures by retrieving the similar signatures from historic

    failures.

    7 CONCLUSIONS

    This paper has presented a method for monitoring high-

    dimensional data in information systems based on theobservation that the high-dimensional measurements are

    usually located in a low-dimensional structure embedded in

    the original space. We have developed both linear and

    nonlinear algorithms to discover the underlying low-

    dimensional structure of data. Two statistics, the Hotelling

    T2 and SPE, have been used to represent the data variations

    within and outside the revealed structure. Based on the

    probabilistic density of these statistics, we have successfully

    detected a variety of simulated failures in a J2EE-based Web

    application. In addition, we have discovered a list of

    suspicious attributes for each detected failure, which are

    helpful in finding the failure root cause.

    24 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 1, JANUARY 2008

    Fig. 14. (a) and (b) The histograms of SP E of normal test samplesproduced by the linear and nonlinear models. (c) and (d) The histogramsof SP E by the linear and nonlinear models for samples from the busy-loop failures, with 30 loops injected. (e) and (f) The histograms of SP Eby the linear and nonlinear models for samples from the busy-loopfailures, with 65 loops injected. (g) and (h) The histograms of SP E bythe linear and nonlinear models for samples from the busy-loop failures,with 100 loops injected.

    TABLE 1The Clustering Results of Failure Measurements Based

    on the Outcomes of Attribute Ranking

    Fig. 13. The ROC curves for failure detectors based on the linear and

    nonlinear models.

    Authorized licensed use limited to: NEC Labs. Downloaded on May 4, 2009 at 18:58 from IEEE Xplore. Restrictions apply.

  • 8/3/2019 Monitoring Failure

    13/13

    REFERENCES[1] C.C. Aggarwal and P.S. Yu, Outlier Detection for High-

    Dimensional Data, Proc. ACM SIGMOD 01, pp. 37-46, 2001.[2] M.K. Aguilera, W. Chen, and S. Toueg, Using the Heartbeat

    Failure Detector for Quiescent Reliable Communication andConsensus in Partitionable Networks, Theoretical ComputerScience, special issue on distributed algorithms, vol. 220, pp. 3-30, 1999.

    [3] T.W. Anderson, An Introduction to Multivariate Statistical Analysis,second ed. Wiley, 1984.[4] M. Balasubramanian and E.L. Schwartz, The Isomap Algorithm

    and Topological Stability, Science, vol. 295, no. 7, 2005.[5] P. Barham, R. Isaacs, R. Mortier, and D. Narayanan, Magpie:

    Real-Time Modeling and Performance-Aware Systems, Proc.Ninth Workshop Hot Topics in Operating Systems (HotOS 03), May2003.

    [6] P. Bodik et al., Combining Visualization and Statistical Analysisto Improve Operator Confidence and Efficiency for FailureDetection and Localization, Proc. Second Intl Conf. AutonomicComputing (ICAC 05), pp. 89-100, June 2005.

    [7] M. Brand, Charting a Manifold, Advances in Neural InformationProcessing Systems 15, MIT Press, 2003.

    [8] T. Brotherton and T. Johnson, Anomaly Detection for AdvancedMilitary Aircraft Using Neural Networks, Proc. IEEE AerospaceConf., pp. 3113-3123, 2001.

    [9] M. Chen, E. Kiciman, E. Fratkin, A. Fox, and E. Brewer, Pinpoint:Problem Determination in Large Dynamic Systems, Proc. IntlPerformance and Dependability Symp. (IPDS 02), June 2002.

    [10] HP OpenView, HP Corp., http://www.openview.hp.com/, 2007.[11] M.J. Desforges, P.J. Jacob, and J.E. Cooper, Applications of

    Probability Density Estimation to the Detection of AbnormalConditions in Engineering, Proc. Inst. of Mechanical Eng.Part C:

    J. Mechanical Eng. Science, vol. 212, pp. 687-703, 1998.[12] G. Eckart and G. Young, The Approximation of One Matrix by

    Another of Low Rank, Psychometrica, vol. 1, pp. 211-218, 1936.[13] R.D. Fierro, G.H. Golub, P.C. Hansen, and D.P. OLeary,

    Regularization by Truncated Total Least Squares, SIAM J.Scientific Computing, vol. 18, pp. 1223-1241, 1997.

    [14] W. Fuller, Measurement Error Models. John Wiley & Sons, 1987.[15] A. Gionis, P. Indyk, and R. Motwani, Similarity Search in High

    Dimensions via Hashing, Proc. 25th Intl Conf. Very Large DataBases (VLDB 99), pp. 518-529, 1999.

    [16] G.H. Golub and C.F. Van Loan, Matrix Computations, third ed.Johns Hopkins Univ. Press, 1996.[17] P. Grassberger and I. Procaccia, Measuring the Strangeness of

    Strange Attractors, Physica D, vol. 9, pp. 189-208, 1983.[18] A. Hoskuldsson, PLS Regression Methods, J. Chemometrics,

    vol. 2, no. 3, pp. 211-228, 1988.[19] Tivoli Business System Manager, IBM, http://www.tivoli.com/,

    2007.[20] T. Ide and H. Kashima, Eigenspace-Based Anomaly Detection in

    Computer Systems, Proc. ACM SIGKDD 04, pp. 440-449, Aug.2004.

    [21] G. Jiang, H. Chen, C. Ungureanu, and K. Yoshihira, Multi-Resolution Abnormal Trace Detection Using Varied-Lengthn-Grams and Automata, Proc. Second Intl Conf. AutonomicComputing (ICAC 05), pp. 111-122, June 2005.

    [22] G. Jiang, H. Chen, and K. Yoshihira, Discovering LikelyInvariants of Distributed Transaction Systems for Autonomic

    System Management, Proc. Third Intl Conf. Autonomic Computing(ICAC 06), pp. 199-208, June 2006.[23] I.T. Jolliffe, Principal Component Analysis. Springer Verlag, 1986.[24] T. Kourti and J.F. MacGregor, Recent Developments in Multi-

    variate SPC Methods for Monitoring and Diagnosing Process andProduct Performance, J. Quality Technology, vol. 28, no. 4, pp. 409-428, 1996.

    [25] R. Kozma, M. Kitamura, M. Sakuma, and Y. Yokoyama,Anomaly Detection by Neural Network Models and StatisticalTime Series Analysis, Proc. IEEE World Congress on ComputationalIntelligence 94, pp. 3207-3210, 1994.

    [26] K. Yamanishi, J. Takeuchi, G. Williams, and P. Milne, On-LineUnsupervised Outlier Detection Using Finite Mixtures withDiscounting Learning Algorithms, Proc. Sixth ACM SIGKDD00, pp. 320-344, 2000.

    [27] M. Markou and S. Singh, Novelty Detection: A ReviewPart 1:Statistical Approaches, Signal Processing, vol. 83, pp. 2481-2497,

    [28] M. Markou and S. Singh, Novelty Detection: A ReviewPart 2:Neural Network Based Approaches, Signal Processing, vol. 83,pp. 2499-2521, 2003.

    [29] L. Mirsky, Symmetric Gauge Functions and Unitarily InvariantNorms, Quarterly J. Math. Oxford, vol. 11, pp. 50-59, 1960.

    [30] M.J. Piovoso, K.A. Kosanovich, and J.P. Yuk, Process DataChemometrics, IEEE Trans. Instrumentation and Measurement,vol. 41, no. 2, pp. 262-268, 1992.

    [31] R.A. Redner and H.F. Walker, Mixture Densities, Maximum

    Likelihood and the EM Algorithm, SIAM Rev., vol. 26, pp. 195-239, 1984.[32] S. Roweis and L. Saul, Nonlinear Dimensionality Reduction by

    Locally Linear Embedding, Science, vol. 290, pp. 2323-2326, 2000.[33] N.K. Shah and P.J. Gempcrlinc, Combination of the Mahalanobis

    Distance and Residual Variance Pattern Recognition Techniquesfor Classification of Near-Infrared Reflectance Spectra, J. Am.Chemical Soc., vol. 62, no. 5, pp. 465-470, 1990.

    [34] D.M.J. Tax and R.P.W. Duin, Support Vector Domain Descrip-tion, Pattern Recognition Letters, vol. 20, pp. 1191-1199, 1999.

    [35] J.B. Tenenbaum, V. de Silva, and J.C. Langford, A GlobalGeometric Framework for Nonlinear Dimensionality Reduction,Science, vol. 290, pp. 2319-2323, 2000.

    [36] S. Van Huffel and J. Vandewalle, The Total Least Squares Problem.Computational Aspects and Analysis. Soc. for Industrial and AppliedMath., 1991.

    Haifeng Chen received the BEng and MEngdegrees in automation from the SoutheastUniversity, China, in 1994 and 1997, respec-tively, and the PhD degree in computer engi-neering from Rutgers University, New Jersey, in2004. He was a researcher at the ChineseNational Research Institute of Power Automa-tion. He is currently a research staff member atthe NEC Laboratories America, Princeton, NewJersey. His research interests include data

    mining, autonomic computing, pattern recognition, and robust statistics.

    Guofei Jiang received the BS and PhD degreesin electrical and computer engineering fromBeijing Institute of Technology, Beijing, in 1993and 1998, respectively. From 1998 to 2000, hewas a postdoctoral fellow in computer engineer-

    ing at Dartmouth College, New Hampshire. He iscurrently a senior research staff member withthe Robust and Secure Systems Group, NECLaboratories America, Princeton, New Jersey.His current research interests include distributed

    systems, dependable and secure computing, and system and informa-tion theory. He has published nearly 50 technical papers in these areas.He is an associate editor for IEEE Security and Privacyand has servedin the program committees of many prestigious conferences.

    Kenji Yoshihira received the BE degree inelectrical engineering from the University ofTokyo in 1996 and the MS degree in computerscience from New York University in 2004. Forfive years, he was with Hitachi, where hedesigned processor chips for enterprise compu-ters. Until 2002, he was a chief technical officer

    (CTO) at Investoria Inc., Japan, where hedeveloped an Internet service system for finan-cial information distribution. He is currently a

    research staff member with the Robust and Secure Systems Group,NEC Laboratories America, Inc., New Jersey. His current researchinterests include distributed systems and autonomic computing.

    . For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

    CHEN ET AL.: MONITORING HIGH-DIMENSIONAL DATA FOR FAILURE DETECTION AND LOCALIZATION IN LARGE-SCALE COMPUTING... 25