Upload
sharmila-saravanan
View
219
Download
0
Embed Size (px)
Citation preview
8/3/2019 Monitoring Failure
1/13
Monitoring High-Dimensional Data for FailureDetection and Localization in Large-Scale
Computing SystemsHaifeng Chen, Guofei Jiang, and Kenji Yoshihira
AbstractIt is a major challenge to process high-dimensional measurements for failure detection and localization in large-scale
computing systems. However, it is observed that in information systems, those measurements are usually located in a low-dimensional
structure that is embedded in the high-dimensional space. From this perspective, a novel approach is proposed to model the geometry
of underlying data generation and detect anomalies based on that model. We consider both linear and nonlinear data generation
models. Two statistics, that is, the Hotelling T2 and the squared prediction error (SPE), are used to reflect data variations within and
outside the model. We track the probabilistic density of extracted statistics to monitor the systems health. After a failure has been
detected, a localization process is also proposed to find the most suspicious attributes related to the failure. Experimental results on
both synthetic data and a real e-commerce application demonstrate the effectiveness of our approach in detecting and localizing
failures in computing systems.
Index TermsFailure detection, manifold learning, statistics, data mining, information system, Internet applications.
1 INTRODUCTION
DETECTING failures promptly in large-scale Internetservices becomes more critical. A single hour ofdowntime of those services such as Google, MSN, andYahoo! could often result in millions of dollars of lostrevenue, bad publicity, and click over to competitors.One significant problem in building such detection tools isthe high dimensionality of measurements collected fromthe large-scale computing infrastructures. For example,
commercial frameworks such as HPs OpenView [10] andIBMs Tivoli [19] aggregate attributes from a variety ofsources, including hardware, networking, operating sys-tems, and application servers. It is hard to extractmeaningful information from those data to distinguishanomalous situations from normal ones.
In many circumstances, however, the system measure-ments are not truly high dimensional. Rather, they canefficiently be summarized in a space with much lowerdimension, because many of the attributes are correlated.For instance, a multitier e-commerce system may have alarge number of user requests everyday, and many internalattributes accordingly react to the volume of user requests
when the requests flow through the system [22]. Suchinternal correlations among system attributes motivate us todevelop a new approach for monitoring the high-dimen-sional data for system failure detection. We discover theunderlying low-dimensional structure of monitoring dataand extract two statistics, that is, the Hotelling T2 score and
the squared prediction error SP E, from each measure-ment to express its variations within and outside thediscovered model. Failure detection is then carried out bytracking the probabilistic density of these statistics alongtime. Each time a new measurement comes in, we calculateits related statistics and then update their probabilisticdensity based on the newly computed values. A largedeviation of the density distribution before and after
updating is regarded as the indication of system failure.We start with the situation where the monitoring data is
generated from a low-dimensional linear structure. Singularvalue decomposition (SVD) is employed to discover thelinear subspace that contains the majority of data. TheHotelling T2 and SP E are then derived from the geometryfeatures of each measurement with respect to that subspace.After that, we extend our work to the case where theunderlying data structure is nonlinear, which is oftenencountered in information systems due to the nonlinearmechanisms such as caching, queuing, and resource poolingin the system. Unlike the linear model, however, there aremany new challenges in the extraction of geometric features
from nonlinear data. For example, there are no parametricequations for globally describing the nonlinear model. Ourapproach is based on the assumption that the data lies on anonlinear (Riemannian) manifold. That is, even though themeasurements are globally nonlinear, they are often smoothand approximately linear in a local region. In the last fewyears, many manifold reconstruction algorithms have beenproposed: locally linear embedding (LLE) [32], isometricfeature mapping (ISOMAP) [35], and so on. However, thesealgorithms all focus on the problem of dimension reduction,which determines the low-dimensional embedding vectoryyi 2 R
r of the original measurement xxi 2 Rp. In this paper,
in order to derive the nonlinear version of Hotelling T2 and
SP E, we need more geometric information about the data
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 1, JANUARY 2008 13
. The authors are with NEC Laboratories America, Inc., 4 IndependenceWay, Suite 200, Princeton, NJ 08540.E-mail: {haifeng, gfj, kenji}@nec-labs.com.
Manuscript received 27 Nov. 2006; revised 4 July 2007; accepted 5 Sept. 2007;published online 12 Sept. 2007.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TKDE-0537-1106.
Digital Object Identifier no. 10.1109/TKDE.2007.190674.1041-4347/08/$25.00 2008 IEEE Published by the IEEE Computer Society
Authorized licensed use limited to: NEC Labs. Downloaded on May 4, 2009 at 18:58 from IEEE Xplore. Restrictions apply.
8/3/2019 Monitoring Failure
2/13
distribution aside from their low-dimensional representa-tion yyis. For instance, the projection of each measurement xxion the underlying manifold in the original space xxi isrequired to compute the SP E value of xxi. Furthermore, ithas been noted that current manifold reconstruction algo-rithms are sensitive to noise and outliers [4], [7]. To solvethese issues, we propose a novel approach to discovering
the underlying geometry of nonlinear manifold for thepurpose of failure detection. We first use the linear error-in-variables (EIV) model [14] in each local region to estimatethe locally smoothed values of each point in that region.Since the local regions are overlapped, each measurementmay have several locally smoothed values. We then proposea fusion process to combine those locally smoothed valuesand obtain a globally smoothed value for each measure-ment. Those globally smoothed values are regarded as theprojections of original measurements on the underlyingmanifold and then fed to the current manifold learningalgorithms such as LLE to obtain their low-dimensionalrepresentations. Note that instead of directly using the
original data for manifold reconstruction, we propose theEIV model and a fusion process to preprocess the data andhence achieve more robust and accurate reconstruction ofthe underlying manifold. As a by-product, we also obtainthe projections of original measurements on the underlyingmanifold in the original space that are necessary forcomputing the SP E value of each measurement.
We also present a statistical test algorithm to decidewhether the linear or nonlinear model is suitable, given themeasurement data. After modeling, we compute the valuesof two statistics of each measurement and estimate theirprobabilistic density. The failure detection is based on thedeviation of newly calculated statistics of each comingmeasurement with respect to the learned density. Once afailure is detected, a localization procedure is proposed toreveal the most suspicious attributes based on the values ofviolated statistics. We use both synthetic data and measure-ments from a real e-commerce application to test theeffectiveness of our proposed approach. The purpose ofusing synthetic data is to demonstrate the advantages of theEIV model and fusion in reconstructing the manifold. Wecompare the results of the LLE algorithm on the originalmeasurements and the data that has been preprocessed bythe EIV model and fusion. The results show that our twoproposed procedures are both necessary to achieve anaccurate reconstruction of the nonlinear manifold. Then, wetest our failure detection and localization methods in a
J2EE-based Web application. We collect measurementsduring normal system operations and apply both linearand nonlinear algorithms to learn the underlying datastructure. We then inject a variety of failures into the systemand compare the performances of failure detectors based onlinear and nonlinear models. It shows that both linear andnonlinear models can detect many failure incidents. How-ever, the nonlinear model produces more accurate resultscompared with the linear model.
2 BACKGROUND AND RELATED WORK
The purpose of this paper is to model the normal behavior
of a system and highlight any significant divergence from
normality to indicate the onset of unknown failures. In datamining, this is called anomaly detection or noveltydetection, and there have been a large number of literaturesin this aspect. Markou and Singh provided a detailedreview of those techniques [27], [28], in which bothstatistical and neural networks-based approaches have beenpresented. Although the statistical approaches [26], [11]
model the data based on their statistical properties and usesuch information to test whether new samples come fromthe same distribution, the neural networks-based methods[25], [8] train a network to implicitly reveal the unknowndata distribution for detecting novelties.
However, much of the early work makes implicitassumptions of the low dimensionality of data and doesnot work well for high-dimensional measurements. Forexample, it is quite difficult for statistical methods to modelthe density of data with hundreds or thousands of attributes.The computational complexity of neural networks is also animportant consideration for high-dimensional data. Toaddress these issues, Aggarwal and Yu [1] proposed a
projection-based method to find the best subsets of attributesthat can reveal data anomalies. Bodik et al. [6] used a NaiveBayes approach, assuming the independent distribution ofattributes, to model the probabilistic density of high-dimensional data. Tax and Duin [34] proposed a supportvector-based approach to identifying a minimal hyper-sphere that surrounds the normal data. The samples locatedoutside the hypersphere are considered as faulty measure-ments. In this paper, our solution is based on the observationthat in information systems, the high-dimensional data areusually located on a low-dimensional underlying structurethat is embedded in high-dimensional space. Unlike theNaive Bayes approach, we discover the correlations among
data attributes and extract the low-dimensional structurethat generates the data. Furthermore, our approach is carriedout in the original data space, without any data mappingsinto their kernel feature space, as in the support vector-basedmethods. As a consequence, we can directly analyze thesuspicious attributes once a failure has been detected.
Detecting and localizing failures promptly is crucial tomission critical information systems. However, somespecific features of those systems introduce challenges forthe detection task. For instance, a large percentage of actualfailures in computing systems are partial failures, whichonly break down part of service functions and do not affectthe operational statistics such as response time. Such partial
failures cannot easily be detected by traditional tools suchas pings and heartbeats [2]. To solve this issue, statisticallearning approaches have recently received a lot of attentiondue to their capabilities in mining large quantities ofmeasurements for interesting patterns that can directly berelated to high-level system behavior. For instance, Ide andKashima [20] treated the Web-based system as a weightedgraph and applied graph mining techniques to monitor thegraph sequences for failure detection. In the Magpie project[5], Barham et al. used the stochastic context-free grammarto model the requests control flow across multiplemachines for detecting component failures and localizingperformance bottlenecks. The Pinpoint project [9], a close
relative to Magpie, proposed two features for system failure
14 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 1, JANUARY 2008
Authorized licensed use limited to: NEC Labs. Downloaded on May 4, 2009 at 18:58 from IEEE Xplore. Restrictions apply.
8/3/2019 Monitoring Failure
3/13
detection: request path shapes and component interactions.For the former, the set of seen traces was modeled with aprobabilistic context-free grammar (PCFG). The latterfeature was used in building a profile for each componentsinteraction and using the 2 test for comparing the currentdistribution with the profile. In the same context of requestshape analysis as in [9], Jiang et al. [21] put forward amultiresolution abnormal trace detection algorithm usingvariable-length N-grams and automata. Bodik et al. [6] madeuse of the user access behavior as an evidence of systemhealth and applied several statistical approaches such as theNaive Bayes to mine such information for detectingapplication-level failures.
We notice that in most circumstances, the attributescollected from information systems are highly correlated.For instance, the interaction profile of one component, asdescribed in [9], is usually correlated with those of othercomponents due to the implicit business logic or othersystem constraints. Similarly, there exist high correlationsbetween different user Web access behaviors [6]. From thisperspective, we believe that the high-dimensional measure-
ments can be summarized in a space with much lowerdimension. We explore such underlying low-dimensionalstructure to detect system failures. Based on the learnedstructure of data, the values of Hotelling T2 and SPE arecalculated for each sample in the online monitoring. Thefailure is then detected based on the deviation of thosestatistics with respect to their distributions. Note that theHotelling T2 and SPE have already been used in chemo-metrics to understand chemical systems and processes [30],[33], [24]. For example, the Soft Independent Modeling ofClass Analogy (SIMCA), a widely used approach inchemometrics, employed the Hotelling T2 and SPE toidentify the classes of system measurements with similarhidden phenomena [33]. Kourti and MacGregor [24] pre-
sented these statistics to monitor the chemical processes anddiagnose performance anomalies. However, those methodsall assume that the data are located in a hyperplane in theoriginal space and relied on linear methods such as theprincipal component analysis (PCA) [23] or partial leastsquares (PLS) [18] to compute the values of Hotelling T2 andSPE. In this paper, we provide a general framework forcalculating these statistics, which considers both linear andnonlinear data generation models. For the nonlinear model,we propose a novel algorithm to extract the Hotelling T2 andSPE from the underlying manifold learned from trainingdata. Furthermore, we have applied the Hotelling T2 andSPE statistics to the failure detection in distributed comput-
ing systems. Experimental results show that these two
statistics are effective in detecting a variety of injectedfailures in our Web-based test system.
3 MODELING THE NORMAL DATA
When system measurements contain hundreds of attributes,our solution for modeling their normal behavior is based on
the fact that actual measurements are usually generatedfrom a structure with much lower dimension. Fig. 1 uses2D examples to illustrate such situations. In Fig. 1a, thenormal data (marked as x) are generated from a 1D linewith certain noise in the 2D space. Similarly, Fig. 1b showsthe data generated from an underlying 1D nonlinearmanifold. Two abnormal measurements are also plotted ineach figure (marked as ). As we can see, the abnormalsample o1 is deviated from the underlying structure.Although the abnormal sample o2 may be located in thestructure, its position in the structure is too far from those ofnormal measurements.
It is observed in Fig. 1 that the implicit structure captures
important information about data distribution. Such prop-erty can be exploited to distinguish abnormal and normalsamples. That is, we discover the underlying geometricmodel and reveal data variations within and outside themodel as features to detect failures. Two statistics, theHotelling T2 and SPE, are utilized to represent such datavariations. We calculate these two statistics for eachmeasurement and then build their probabilistic distributionbased on the computed values. In the monitoring process,we compute the statistics of new measurements and checktheir values with respect to the learned distribution to detectfailures. Fig. 2 provides the workflow of our normal datamodeling. We consider both linear and nonlinear datageneration models, which are described in Sections 3.1 and3.2, respectively. Section 3.3 then presents a criterion todetermine whether the linear or nonlinear model is suitablefor the available data.
3.1 Linear Model
If the measurements xxi 2 Rp, with i 1; ; n, are gener-
ated from a low-dimensional hyperplane in Rp, we applythe SVD of data matrix X xx1 xxn:
X UV> UssV>
s UnnV>
n ; 1
where
diag1; ; r; r1; ; 2 Rpn;
CHEN ET AL.: MONITORING HIGH-DIMENSIONAL DATA FOR FAILURE DETECTION AND LOCALIZATION IN LARGE-SCALE COMPUTING... 15
Fig. 1. Two-dimensional data examples. (a) Linear case. (b) Nonlinear
case.
Fig. 2. The workflow of normal data modeling.
Authorized licensed use limited to: NEC Labs. Downloaded on May 4, 2009 at 18:58 from IEEE Xplore. Restrictions apply.
8/3/2019 Monitoring Failure
4/13
and 1 ! r ) r1 ! , with minfp;ng. The twoorthogonal matrices U and V are called the left and righteigenmatrices of X. Based on the magnitude of singularvalues i, the space ofX is decomposed into signal and noisesubspaces. The left r columns ofU, Us u1;u2; ur, formthe bases of signal space, and s diag1; ; r. Anyvector xx 2 Rp can be represented by the summation of two
projection vectors from two subspaces:
xx xx ~xx; 2
where xx UsU>s xx is the projection of xx on the hyperplane
expressed in the original space, which represents the signalpart ofxx, and ~xx contains the noise. Meanwhile, we can alsoobtain the low-dimensional representation of xx in the signalsubspace:
yy U>s xx: 3
The vector yy 2 Rr is called theprincipal component vector ofxx,which represents the r-dimensional coordinates of xx in thesignal subspace. The ith element in yy is called theith principal component of xx. Since the principal compo-nents are uncorrelated and the variance of the ith componentyi is
2i [16], the covariance ofyy is Cy diag
21; ;
2r . Note
that for ease of explanation, we assume that the data xxis arecentered. In real situations, we need to center the data beforethe above calculations.
Two statistics are defined for each sample xx torepresent its variations within and outside the signalsubspace. One is the Hotelling T2 score [23], which isexpressed as the Mahalanobis distance from xxs principalcomponents yy to the mean of principal component vectorsfrom the training data:
T2
yy "yy>
C1y yy "yy; 4
where C1y is the inverse matrix of Cy. Since "yy 0, we cansimplify (4) as T2 yy>C1y yy. Another statistic, the SPE [23],indicates how well the sample xx conforms to the hyper-plane, measured by the euclidean distance between xx andits projection xx on the hyperplane expressed in the originalspace:
SP E k~xxk2 kxx xxk2: 5
The intuition of using these two statistics for failuredetection is illustrated in Fig. 3, in which 2D normalsamples, marked as x, are generated from a line
(1D subspace) with certain noises. Through subspacedecomposition, we obtain the direction of the line (thesignal subspace) and then project each sample xx onto thatline to get xx. In this case, the Hotelling T2 score representsthe Mahalanobis distance from xx to the origin (0, 0). Thevalue of SPE is the squared distance between xx and xx. Twoabnormal samples, marked as , are also shown in Fig. 3a.Since the sample o1 is far from the line, its SPE is muchlarger than those of other points. Although the sample o2has a reasonable SPE value, its Hotelling T2 score is verylarge, since its projection on the line o2s is far from thecluster of projected normal samples. We plot the histogramsof the Hotelling T2 score and SPE value for all the samples,
as shown in Figs. 3b and 3c, respectively. Based on these
histograms, we conclude that by defining suitable bound-
aries for normal samples in the extracted statistics, we canfind abnormal measurements and hence detect failures.
3.2 Nonlinear Model
When the measurements xxis are generated from a nonlinearstructure, we still desire to have statistics that serve thesame purpose of Hotelling T2 and SPE in the linear model.
To derive the nonlinear version of these two statistics, our
approach is based on the assumption that the data lies on anonlinear (Riemannian) manifold. We discover the under-lying manifold of high-dimensional measurements and
then define the corresponding statistics based on thegeometric features of each sample with respect to thediscovered manifold.
According to the original definition of Hotelling T2 in (4)and SPE in (5), we need the following information to gettheir nonlinear estimates:
. the low-dimensional embedding vector yy of theoriginal measurement xx, where yy represents the low-dimensional coordinates of xx in the underlying
manifold instead of the linear hyper plane, and. the projection xx of measurement xx on the manifold
in the original space.
If we have the values of these variables, the nonlinearversion of Hotelling T2 and SP E for each sample xx can bedefined in the same form as that in the linear situation. For
instance, the nonlinear T2 is expressed as in (4), except thatthe yy is computed from the underlying manifold, "yy is themean of yy for all the training data, and Cy denotes the
sample covariance matrix:
Cy 1
n 1Xn
j1
yyj "yyyyj "yy>: 6
16 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 1, JANUARY 2008
Fig. 3. The role of extracted statistics for failure detection. (a) The
2D normal data together with two outliers. (b) The histogram of Hotelling
T2 of all the samples. (c) The histogram of SP E of all the samples.
Authorized licensed use limited to: NEC Labs. Downloaded on May 4, 2009 at 18:58 from IEEE Xplore. Restrictions apply.
8/3/2019 Monitoring Failure
5/13
Similarly, the nonlinear SPE for sample xx is also defined in(5), with xx representing the projection ofxx on the manifold.
In the last few years, many manifold reconstructionalgorithms have been proposed, such as the LLE [32] andISOMAP [35]. However, these algorithms all focus on theproblem of dimension reduction, which only outputs thelow-dimensional embedding vector yy of the original
measurement xx and does not directly compute the projec-tion xx of xx on the manifold. Furthermore, in practicalsituations, the measurements are always noisy, and it isnoted that the LLE and ISOMAP algorithms are sensitive tonoise [4], [7]. As a consequence, we use two steps to obtainthe necessary estimates:
xxi 1
xxi 2
yyi i 1; ; n: 7
We start by calculating the projection xxi of each sample xxi onthe underlying manifold, followed by the estimation of yyisbased on the projected variables. Note that here, we use theestimate xxi as an intermediary for obtaining yyi. The accuracy
of the estimate xxi is therefore important to the computationof two statistics. To get the reliable xxis, we first apply thelinear EIV model, as described in Section 3.2.1, in theneighborhood region of each sample xxi to compute thelocally smoothed value of every sample in that neighbor-hood. Since the local regions are overlapped, each sampleusually has several locally smoothed values. We thenpresent a fusion process in Section 3.2.2 to combine all thelocally smoothed values of xxi and hence obtain itsprojection xxi on the manifold in the original space. Oncethe projections xxis are available, they are input to currentmanifold learning algorithms such as LLE to estimate thelow-dimensional embedding vector yyis, which is described
in Section 3.2.3.
3.2.1 Local Error-in-Variables (EIV) Model
We start with the k-nearest neighbor search for eachsample xxi to define its local region. Given a set of
p-dimensional vectors xxi1 ; ; xx
ik located in the neighbor-
hood of xxi, a local smoothing of those vectors is performedbased on the geometry of the local region that comprises thepoints. For simplicity, we use xx1; ; xxk to denote thoseneighborhood points of xxi. In current manifold reconstruc-tion algorithms such as LLE, the local geometry isrepresented by a weight vector ww wi1 wik
> thatbest reconstructs xxi from its neighbors. By minimizing the
following reconstruction error
xxi Xkj1
wijxxj
2
Xpl1
xil Xkj1
wijxjl
!28
subject toP
j wij 1, where xil or xjl represents thelth element of vector xxi xxj, the LLE obtains the leastsquares solution of ww for the local region surrounding xxi.Equation (8) assumes that all the neighbors of xxi are free ofnoise, and only the measurement xxi is noisy. This isfrequently unrealistic, since usually, all the samples arecorrupted by noise. As a result, the solution of (8) is biased.To remedy this problem, we use the following EIV model
[14] by minimizing
0 xxi Xkj1
wijxxj
2
Xkj1
kxxj xxjk2 9
subject toP
j wij 1, where xxj is the local noise-freeestimate of sample xxj in the region surrounding xxi.
Equation (9) can also be represented as
0 kxxi xxik2
Xkj1
kxxj xxjk2 10
by taking into account that xxi P
j wijxxj. Define thematrices
B xxi xx1 xx2 xxk 2 Rpk1
and E xxi xx1 xx2 xxk 2 Rpk1 to contain the locally
smoothed estimates of those points, with p ) k, forcommon cases where the nonlinear manifold is embeddedin a high-dimensional space. The problem (10) can bereformulated as
min kB Ek2
11
subject to
E 0; 12
where 1 wi1 wi2 wik>. From (12), the rank ofE
is k. Therefore, the estimate E is the rank k approximation
of matrix B. If the SVD of B is BPk1
j1 jujv>
j , with
1 ! 2 ! k1, we obtain the noise-free sample matrix E
from the Eckart-Young-Mirsky Theorem [12], [29] as
E
Xk
j1
jujv>
j : 13
The weight vector ww can also be estimated from the SVDof matrix B. Since our purpose here is to find the localnoise-free estimate E, we do not plan to discuss it in detail.For an in-depth treatment of the EIV model and itssolutions, see [14], [36]. According to [36], we can alsoobtain the first-order approximation of the covariance Cw ofthe parameter ww, which is proportional to the estimation ofthe variance of noise 2
2k1=p k.
3.2.2 Fusion
We apply the EIV model in the neighborhood region ofevery sample xxi to obtain the locally smoothed value of
every point in that region. Since each sample xxi is usuallyincluded in the neighborhoods of several other points andin its own local region, it has more than one local noise-freeestimate. Given those different estimates fxx
1i ; xx
2i ; ; xx
hi g
with h ! 1 our goal is to find a global noise-free estimate xxiof xxi from its many local values. Such global noise-freeestimate xxi can be regarded as the projection of xxi onto themanifold in the original space.
Due to the variation of curvatures of the underlyingmanifold, the linear model presented in Section 3.2.1 maynot always succeed in discovering local structures. Forinstance, in the local region with a large curvature, the noise-free estimate xx
i is not reliable. Suppose we have the
covariance matrix Cj of^xx
j
i to characterize the uncertainty of
CHEN ET AL.: MONITORING HIGH-DIMENSIONAL DATA FOR FAILURE DETECTION AND LOCALIZATION IN LARGE-SCALE COMPUTING... 17
Authorized licensed use limited to: NEC Labs. Downloaded on May 4, 2009 at 18:58 from IEEE Xplore. Restrictions apply.
8/3/2019 Monitoring Failure
6/13
linear fitting, through which xxji is obtained. Then, the global
estimate of a noise-free value xxi can be found by minimizing
the sum of the following Mahalanobis distances:
xxi argminxi
Xhj1
xxi xxji
>C
1j xxi xx
ji : 14
The solution of (14) is
xxi Xhj1
C1
j
!1Xhj1
C1
j xxji : 15
That is, the global estimate xxi is characterized by the
covariance weighted average of xxji s. The more uncertain a
local estimate xxji is (the inverse of its covariance has a
smaller norm), the less the instance that it contributes to the
result of the global value xxi.The covariance Cj can be estimated from the error
propagation of the covariance of ww during the calculation
of xxji . However, since the dimension of xx
ji is high, it is
not feasible to directly use Cj. Instead, we use thedeterminant of Cj to approximate (15):
xxi Xhj1
jCjj1
!1Xhj1
jCjj1xx
ji ; 16
where jCjj is approximated by jCjj % 2 , and is a constant
that does not contribute to the calculation of (16).
3.2.3 Manifold Reconstruction
Once we have the estimate xxis, we feed them to the currentmanifold reconstruction algorithms such as LLE and ISO-
MAP to obtain thelow-dimensional embedding vector yyis. In
our approach, the EIV modeling and fusion process actuallyserve as preprocessing procedures for those manifold
learning algorithms in order to achieve a more robust and
accurate reconstruction of the nonlinear manifold. One by-
product of this is that we can also obtain the projection ofeach measurement on the manifold in the original space. We
do not plan to describe the LLE or ISOMAP algorithm here.
The readers may refer to [32], [35] for details.When the projection xxis and the low-dimensional
embedding vector yyis are available, the nonlinear Hotelling
T2 and SP E of xxis are calculated in the same way as in (4)
and (5). We then use the Gaussian mixture model to
estimate the density distribution of two statistics. Section 4will show that by monitoring these statistics along time, we
can detect abnormal points that deviated from the under-
lying manifold.
3.3 Linear versus Nonlinear
The linear model is easy to understand and simple to
implement. On the other hand, nonlinear models are more
accurate when the nonlinearities of the underlying structure
are too pronounced to be approximated by linear models. To
make the correct decision of choosing the model, given a setof measurements, we first estimate the intrinsic dimension of
data and then apply a statistical test to decide whether the
linear or nonlinear model has to be used.
We use the method proposed in [17] to estimate theintrinsic dimension. It is based on the observation that foran r-dimensional data set embedded in the p-dimensionalspace, xxi 2 R
p, with i 1; ; n, the number of pairs ofpoints closer to each other than is proportional to r. Wedefine
Hn 2nn 1
Xni
8/3/2019 Monitoring Failure
7/13
of variations in system behavior. To deal with such systemswith changing behavior, Section 4.2 provides anothersolution to detect failures. We employ an online updatingalgorithm to sequentially update the distribution of twostatistics for every sample in the monitoring process. Thefailure is then detected based on the statistical deviation ofdensity distribution before and after the updating.
For the linear model, it is easy to compute theHotelling T2 and SP E for each new sample xxt based on(4) and (5). However, calculating those statistics online isnot straightforward in the nonlinear case, since there are noglobal formula or parameters to describe the nonlinearmanifold. Therefore, Section 4.1 focuses on the onlinestatistics computing for the nonlinear model. Once thenewly computed statistics are available, Section 4.2 thenpresents the way of updating their density and detectinganomaly based on the updated density.
4.1 Online Computing of Two Statistics
Fig. 4 presents the algorithm of the online computing ofHotelling T2 and SP E for the nonlinear model. Given anew measurement xxt, the first step is to find the nearestpatch of xxt on the discovered manifold from the trainingdata. We start by locating the nearest point xx of xxt in thetraining data together with its nearest neighbors (includingxx itself) xx1; xx
2; ; xx
k, followed by retrieving their projec-
tions on the manifold xx1; ; xxk. The plane spanned by xx
ks
is then regarded as the nearest patch of xxt on the manifold.Note that the nearest neighbors of xx and their projectionshave already been calculated during the training phase, andno extra calculations are needed. The only issue is findingthe nearest point xx of xxt in the training data. For high-dimensional xxt, the complexity of nearest neighbor query ispractically linear with respect to the training data size. Tospeed up computation, the locally sensitive hashing (LSH)[15] data structure is constructed for the training data toapproximate the nearest neighbor search. As a consequence,the nearest neighbor query time has sublinear dependenceson the data size.
Step 2 in Fig. 4 then projects xxt onto its nearest patch.
We assume that the patch spanned by^xx
i s is linear and
builds a matrix X xx1; ; xxk. By solving the equation
Xww xxt by least squares, we get the estimation ww
X>X1X>xxt and the projection of xxt on the manifold:
xxt Xww XX>X1X>xxt: 22
Once we obtain the weight ww, the low-dimensionalembedding vector yyt of xxt is calculated by (21) based onthe observation that the local geometry in the original spaceshould equally be valid for local patches in the manifold.Accordingly, the Hotelling T2 and SP E values of the newsample xxt are calculated.
4.2 Density-Based Detection
Once the newly calculated Hotelling T2 and SP E areavailable, we use the sequentially dynamic expectation-maximization (SDEM) algorithm [26], as described inFig. 5, to update the density (20). Based on the originalexpectation-maximization (EM) algorithm for Gaussianmixture models [31], the SDEM utilizes an exponentiallyweighted moving average (EWMA) filter to adapt to frequentsystem changes. For instance, given a set of observationsfx1; x2; ; xn; g, an online EWMA filter of the mean isexpressed as
n1 1 n xn1; 23
where the forgetting parameter dictates the degree ofdiscounting previous examples. Intuitively, the larger is,the faster the algorithm can age out past examples. Notethat there is another parameter in Fig. 5, which is setbetween [1.0, 2.0], in the estimation ofi in order to improvethe stability of the solution.
The anomaly is then determined based on the statistical
deviation of density distribution before and after the new
statistics zzt is obtained. If we denote the two distributions as
pt1z and p
tz, respectively, our metric called the Hellinger
score is defined by
sHzzt
Z ffiffiffiffiffiffiffip
tz
q
ffiffiffiffiffiffiffiffiffiffiffip
t1z
q 2dz: 24
Intuitively, this score measures how much the probability
density function ptz has moved from p
t1z after learning zzt.
A higher score indicates that zzt is an outlier with high
probability. For the efficient computation of the Hellinger
score, see [26].
CHEN ET AL.: MONITORING HIGH-DIMENSIONAL DATA FOR FAILURE DETECTION AND LOCALIZATION IN LARGE-SCALE COMPUTING... 19
Fig. 4. Online computing of Hotelling T2 and SP E for the nonlinear
model.
Fig. 5. The SDEM algorithm for updating the mixing probability cti and
the mean ti and covariance
ti of k Gaussian functions in (20), given
the new statistics zzt.
Authorized licensed use limited to: NEC Labs. Downloaded on May 4, 2009 at 18:58 from IEEE Xplore. Restrictions apply.
8/3/2019 Monitoring Failure
8/13
5 FAILURE LOCALIZATION
This section discusses finding out a set of the mostsuspicious attributes, after a failure has been detected, based on their relation to the failure. Although thesereturned attributes may not tell the exact root cause of thefailure, they still contain useful clues and can therebygreatly help system operators in narrowing down the
debugging scope to some highly probable components.We collect and analyze the failure measurement xxf to
determine which of its statistics gets wrong by comparingeach statistic with its marginal distribution derived from the joint density (20). If the statistic SP E deviates from itsdistribution, we use the ranking method in Section 5.1 toreturn the most suspicious variables. Similarly, if theHotelling T2 goes wrong, we use the method in Section 5.2to rank the variables. If both statistics get wrong, the unionof two sets of the most suspicious variables is returned.
5.1 Variable Ranking from SP E
Given the failure measurement xxf, according to (5), its SP Erepresents the squared norm of residual vector ~xxf xxf xxf.Our method for finding the most suspicious attributes fromthe deviated SPE is based on the absolute value of eachelement in the residual ~xxf. If the ith element in ~xxf has largeabsolute values, then the ith attribute is one of the mostsuspicious attributes.
However, the direct comparison of each element in ~xxf isnot reliable, since the attributes have different scales. Inorder to prevent outweighing one attribute with otherattributes, we record the absolute values of each element inthe residual vectors of training samples and calculate theirmean ~m and the standard deviation ~. The ith element in ~xxfis then transformed to a new variable:
si j~xfi j ~mi~i
; 25
where ~mi and ~i represent the ith element of ~m and ~,respectively. A large si value indicates the high importanceof the ith attribute to the detected failure.
5.2 Variable Ranking from Hotelling T2
For the linear model, the Hotelling T2 in (4) can be
simplified as T2 yy>C1y yy Pr
i1y2i2i
. We define
vt y2121
; ;y2r2r
>; 26
in which theith
element represents the importance of theith principal component yi to the T2 statistic. However, theprincipal components usually do not have any physicalmeaning. To reveal failure evidence in the attribute level,we compute the contribution of the original variable xi toeach principal component in terms of extracted variancesand denote it as vi. It is expected that the variable whose vihas the same element distribution as vt is the mostsuspicious. Therefore, we compute
ti v>i vt 27
for the ith attribute and reveal the suspicious attributesbased on tis. If the value of ti is large, then the ith attribute
needs more attention in the debugging.
In order to calculate vi for each variable xi, we considerthe principal component loading L 2 Rnr [3], in whicheach element lij represents the correlation between theith variable xi and the j principal component yj:
lij Exiyj: 28
The matrix L can be computed from the SVD of data matrix
X [3], and the square of its element l2i;j tells the proportion ofvariance in the original variable xi explained by theprincipal component yj. Therefore, we define a matrix Mwith element Mij l2ijvarxi to represent the actualvariance in xi that is explained by yj. The summation ofthe jth column in M, fj
Pi Mij, represents the total
variance extracted by the jth principal component yj. Wedivide each element Mij by fj to obtain a new matrix ~Mwith each element ~Mij Mij=fj. The ith row of matrix ~Mthen represents the contribution of the original variable xi toeach principal component in terms of extracted variances:
vi ~Mi1 ~Mir>: 29
For the nonlinear model, however, the low-dimensionalembedding vector yy cannot be expressed as the linearcombination of original attributes. In order to apply theabove variable ranking method to nonlinear situations, weperform SVD on the nearest patch of xxf, X xx1; ; xx
k,
as discovered in Section 4.1. By doing so, we obtain xxfs localprincipal components yyf. The loading matrix L and, hence,the vis are then calculated from the data matrix X
and yyf,and vt is computed from yyf. Accordingly, the ti values in(27) can be calculated to rank the variables.
6 EXPERIMENTAL RESULTS
In this section, we first use some synthetic data todemonstrate the advantages of the local EIV modelingand fusion process, as described in Sections 3.2.1 and 3.2.2,in achieving a more accurate reconstruction of the nonlinearmanifold. We also evaluate both the linear and nonlinearmethods in detecting a set of generated outliers. Then, weapply our proposed high-dimensional data monitoringapproach to a real J2EE-based application to detect avariety of injected system failures.
6.1 Synthetic Data
In addition to providing an original framework of monitor-ing high-dimensional measurements in information sys-tems, there are also novel algorithmic contributions to the
nonlinear manifold recovery in this paper. We haveproposed the local EIV model and a fusion process toreduce the noise in the original measurements and henceachieve a more accurate reconstruction of the underlyingmanifold. In Section 6.1.1, we use synthetic data todemonstrate the effectiveness of our proposed algorithms.In addition, we generate some outliers to evaluate thedetection performance of both linear and nonlinear modelsin Section 6.1.2.
6.1.1 Manifold Reconstruction
For ease of visualization, a 1D manifold (curve) is used inthis example. The 400 data points are generated by gt
t cost; t sint
>
added with a certain amount of Gaussian
20 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 1, JANUARY 2008
Authorized licensed use limited to: NEC Labs. Downloaded on May 4, 2009 at 18:58 from IEEE Xplore. Restrictions apply.
8/3/2019 Monitoring Failure
9/13
noise, where t is uniformly sampled in the interval 0; 4.
Fig. 6a shows such curve under the noise with standard
deviation 0:3. Since in this example, the data dimension
p 2 is smaller than the size of the neighborhood k (=12
for this data set), we use a regularized solution of the EIV
model [13] to calculate the matrix E in (13).The reconstruction accuracy for 1D manifold is mea-
sured by the relationship between the recovered manifold
coordinate ~ and centered arc length t defined as
t Z
t
t0
kJgtkdt; 30
where Jgt is the Jacobian of gt:
Jgt cost t sint; sint t cost>: 31
The more accurate the manifold reconstruction is, the more
linear the relationship between ~ and becomes. Figs. 1b,
1c, and 1d show their relationship curves generated by the
LLE algorithm, LLE with EIV modeling, and LLE with both
EIV modeling and fusion. It is obvious that the LLE with
both EIV modeling and fusion outperforms the other two
algorithms. Note that in the LLE algorithm with only EIV
modeling, the estimate of the projection xx is taken as thelocally smoothed value from the region that surrounds xx.
Further comparison of the performances of the three
algorithms are carried out by some random simulations on
manifolds in the 100-dimensional space. We generate the
same 2D data gtis as in the first example, followed by
transforming the data into 100-D vectors by orthogonal
transformation xxi Qgti, where Q 2 R1002 is a random
orthonormal matrix. We add different levels of noise on the
data, with the standard deviation ranging from 0.1 to 1. At
each noise level, 100 trials are run. We use the correlation
coefficient between the recovered manifold coordinate ~
and the centered arc length
cov~; ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
var~varp 32
to measure the strength of their linear relationship. Fig. 7shows the mean and standard deviation of correlationcoefficient , as obtained by LLE, LLE with EIV modeling,and LLE with EIV modeling and fusion, for 100 trials underdifferent noise levels, with standard deviation rangingfrom 0.1 to 1. Note that the vertical bars in this figure markone standard deviation from the mean. It illustrates thatboth the EIV modeling and fusion are beneficial to reducingthe noise of data samples.
6.1.2 Outlier Detection
We use the data from a highly nonlinear structure toillustrate the effectiveness of manifold reconstruction andthe two statistics in detecting outliers. We generate 1,000 3Ddata zi from a 2D Swiss roll, which is shown in Fig. 8a. Those3D data are thentransformed into 100-dimensional vectors xxiby an orthogonal transformation xxi Qzi, where Q 2 R
1003
CHEN ET AL.: MONITORING HIGH-DIMENSIONAL DATA FOR FAILURE DETECTION AND LOCALIZATION IN LARGE-SCALE COMPUTING... 21
Fig. 6. (a) Samples from a noisy 1D manifold. (b) The centered arclength versus manifold coordinate ~ recovered by LLE. (c) Thecentered arc length versus manifold coordinate ~ recovered by LLEwith the EIV modeling. (d) The centered arc length versus manifoldcoordinate ~ recovered by LLE with the EIV modeling and fusion.
Fig. 7. Performance comparison of LLE, LLE with EIV, and LLE with EIV
and fusion in noise reduction. The vertical bars mark one standard
deviation from the mean.
Fig. 8. The comparison of linear and nonlinear models in outlierdetection. (a) The 3D view of training data. (b) The 3D view of testsamples (marked as ) and the training data (marked as .).(c) The scatterplot of the SP E and T2 statistics of test samplescomputed by the linear method. The normal test samples are markedas . and the outliers are marked as x. (d) The scatterplot of theSP E and T2 statistics of test samples computed by the nonlinearmethod. The normal test samples are marked as . and the outliersare marked as x.
Authorized licensed use limited to: NEC Labs. Downloaded on May 4, 2009 at 18:58 from IEEE Xplore. Restrictions apply.
8/3/2019 Monitoring Failure
10/13
is a random orthonormal matrix. Some Gaussian noises withstandard deviation 0.5 are also added to the 100D vectors.Meanwhile, we also generate 200 test samples, in which halfof them are normal data generated in the same way astraining data generation, and the others are randomlygenerated outliers. Fig. 8b presents a 3D view of the testdata (marked as ) and the training data.
We apply both linear PCA and the proposed manifoldreconstruction algorithms to reconstruct the underlyingstructure from the training data. Based on the learnedstructure, the T2 and SP E statistics are computed for thetraining data and test samples. Figs. 8c and 8d present thescatterplot of two statistics of test data computed from the
linear and nonlinear methods, respectively, in which thenormal data are plotted as . and the outliers are markedas x. In Fig. 8c, we see that the normal samples andoutliers are highly overlapped. The linear method showspoor performance in detecting the outliers because of thehigh nonlinearity of data structure. On the other hand, theresults from the nonlinear method present a clear separa-tion between normal samples and outliers, which is shownin Fig. 8d. In this figure, we notice that the SP E statisticplays an important role in distinguishing outliers from thenormal data. This is reasonable, because the outliers areusually randomly generated, and it is less likely for them tobe located in the same manifold structure with the inliers.
However, the T2 statistic, which measures the in-modeldistances among samples, is also useful in the outlieridentification. As shown in Fig. 8d, two outliers (8.7, 86.9)and (8.4, 108.1) exhibit small SP E values but largeT2 scores. We check those two points in the original3D data and find that they are actually located in themanifold of inliers but far from the cluster of the inlier data.In this case, the T2 statistic provides good evidence aboutthe existence of outliers.
6.2 Real Test Bed
Our algorithms have been tested on a real e-commerceapplication, which is based on the J2EE multitiered
architecture. J2EE is a widely adopted platform standardfor constructing enterprise applications based on deployableJava components, called Enterprise Java Beans (EJBs). Thearchitecture of our testbed system is shown in Fig. 9. We useApache as a Web server. The application server consists ofthe Web container (Tomcat) and the EJB container (JBoss).The MySQL is running at the back end to provide persistentstorage of data. PetStore 1.3.2 is deployed as our testbedapplication. Its functionality consists of storefront, shoppingcart, purchase tracking, and so on. There are 47 componentsin PetStore, including EJBs, Servlets, and JSPs. We build aclient emulator to generate a workload similar to that createdby typical user behavior. The emulator produces a varying
number of concurrent client connections, with each client
simulating a session based on some common scenarios,which consists of a series of requests such as creating new
accounts, searching by keywords, browsing for item details,updating user profiles, placing orders, and checking out.
The monitored data are collected from three servers(Web sever, application server, and database sever) in ourtestbed system. Each server generates measurements from avariety of sources such as CPU, disk, network, andoperating systems. Fig. 10 lists all these attributes, whichare divided into eight categories. The three right columns inthis figure give the number of attributes in each categorygenerated in three servers, respectively. In total, there are111 attributes contained in each measurement. We manu-ally check these attributes and observe that many of themare correlated. Fig. 11 presents an example of four highly
correlated attributes. It suggests that our proposed ap-proach is feasible to this type of data.
We collect the measurements every 5 seconds undersystem normal operations, with the magnitude of workloadrandomly generated between 0 and 100 user requests persecond. In total, 5,000 data samples are gathered as thetraining data. To determine whether the linear or nonlinearmodel best characterizes that data set, we calculate Hnifor different is, as described in Section 3.3, fit a linebetween their log values logHni 5:83 log i 15:14, and
22 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 1, JANUARY 2008
Fig. 9. The architecture of the testbed system.
Fig. 10. The list of attributes from the testbed.
Fig. 11. Example of four correlated attributes.
Authorized licensed use limited to: NEC Labs. Downloaded on May 4, 2009 at 18:58 from IEEE Xplore. Restrictions apply.
8/3/2019 Monitoring Failure
11/13
get the intrinsic dimension r 5:83 6. Since the calcu-lated 0:92 from (19) is smaller than the threshold 0.98, itis suggested that the nonlinear model be applied for thedata. In the following, we will confirm this conclusion bycomparing the performances of the linear and nonlinearmodels in detecting a variety of injected failures in thesystem.
We modify the codes in some EJB components of thePetStore application to simulate a number of real systemfailures. Five types of faults are injected into variouscomponents, with different intensities, to demonstrate therobustness of our approach.
Memory Leaking. We simulate three memory leakingfailures by repeatedly allocating three different sizes(1 Kbyte, 10 Kbytes, and 100 Kbytes) of heap memory intothe ShoppingCartEJB of the PetStore application. Since thereference of that EJB object is always pointed from otherobjects, the Java garbage collector does not notice thisleakage of memory. Hence, the PetStore application willgradually exhaust the supply of virtual memory pages,
which leads to severe performance issues and makes theaccomplishment of client requests much slower.File Missing. In the packaging process of Java Web
applications, it might happen that a file is improperlydropped from the required composition, which will resultin failures of invoking a correct system response, and mayeventually cause service malfunction, which makes the usercome across strange Web pages. Here, we simulate five suchfailures by dropping different JSP files from the PetStoreapplication to mimic the operators mistakes during systemmaintenance.
Busy Loop. The actual causes of request slowdown can be quite a few such as the spinlock fault among synchro-nized threads. We simulate the phenomenon of slowdown
by adding a busy-loop procedure in the code. Dependingon the number of loops in the instrumentation, thesignificance of simulation is different. In this section, wesimulate five different busy-loop failures by allocating 30,65, 100, 150, and 300 loops in the ShoppingCartLocalEJB ofthe PetStore application, respectively.
Expected Exception. The expected exception [9] happenswhen a method declaring exceptions (which appears in themethods signature) is invoked. In this situation, anexception is thrown, without the methods code beingexecuted. As a consequence, the user may encounterstrange Web pages. We inject such fault into two differentEJBs of PetStore, that is, CatalogEJB and AddressEJB, to
generate two expected exception failures.
Null Call. The null call fault [9] causes all methods in theaffected component to return a null value without executingthe methods code. It is usually caused by the errors inallocating system resources, failed lookups, and so on.Similar as the expected exception, the null-call failureresults in strange Web pages. We inject such fault into twodifferent EJBs of PetStore, that is, CatalogEJB and Addres-
sEJB, to generate two failure cases.As a result, altogether, 17 failures cases are simulated
from the five different types. Note that the system isrestarted before every failure injection in order to removethe impact of previous injected failures. In addition, theworkloads are dynamically generated with much random-ness so that we never get a similar workload twice in theexperiments. We randomly collect a certain number ofmeasurements from each failure case and in total obtain425 abnormal measurements. We also collect 575 normalsamples to make the test data set contain 1,000 samples.
The linear and nonlinear models are compared inrepresenting the training data and detecting failure samplesfrom the test data. Figs. 12a and 12b present the scatterplots
of Hotelling T2 and SP E of the test data produced by thelinear and nonlinear models, respectively. The values ofnormal test data are marked as . in the figures, and thoseof failure data are marked as . For the linear modelshown in Fig. 12a, there is an overlap between the normaldata distribution and that of the failure samples. Inaddition, there are four normal points with very large SPEvalues (at around 120). We check those points in Fig. 12bprovided by the nonlinear method and find that amongthose four points, three are located in the cluster of normalsamples, and only one point, (39.5, 2.1), is hard to separatefrom outliers. Compared with the linear model, the non-linear model produces more clear separation betweennormal and abnormal samples in the generated statistics.
We also notice that for the nonlinear model, the SPE statisticplays a dominant role in detecting the outliers. Based on thesimilar observation that we obtained in Fig. 8d from thesynthetic data, we can explain this by two factors: 1) thenonlinear method correctly identifies the underlying datastructure and 2) in the experiment, most failure points arelocated outside the discovered manifold. In spite of theimportance of SPE, the T2 statistic is also useful in failuredetection, especially when the SPE values are in theambiguity region, for example, between 27 and 37 forthe nonlinear model. For the linear method, such ambi-guity region of SPE is wider, which is from 15 to 35, asshown in Fig. 12a, because of its linear assumption ofunderlying data structure. In order to qualitatively comparethe performances of two detectors, we use the methoddescribed in Section 4.2 to build the joint density of T2 andSP E based on values computed from the training data andcalculate the Hellinger score (24) for every test sample.Based on these scores, the ROC curves of the two modelsare plotted in Fig. 13. It shows that both the linear andnonlinear methods obtain acceptable results in detecting thefailure samples due to the moderate nonlinearity of datagenerated in the experiment. However, the nonlinear modelproduces more accurate results than the linear model.
A further investigation of the nonlinear model revealsits more advantages over the linear method. We find thatthe values of T2 and SP E calculated by the nonlinear
model can also provide useful clues about the significance
CHEN ET AL.: MONITORING HIGH-DIMENSIONAL DATA FOR FAILURE DETECTION AND LOCALIZATION IN LARGE-SCALE COMPUTING... 23
Fig. 12. The scatterplot of SPE and T2 of the test data produced by the
(a) linear model and (b) the nonlinear model. The normal test data are
marked as . and the failure data are marked as .
Authorized licensed use limited to: NEC Labs. Downloaded on May 4, 2009 at 18:58 from IEEE Xplore. Restrictions apply.
8/3/2019 Monitoring Failure
12/13
of injected failures. Fig. 14 uses the values of SP E on thebusy-loop failure to demonstrate this fact. Fig. 14b showsthe histogram of SP E values of normal test data generatedby the nonlinear model. Figs. 14d, 14f, and 14h present theSP E of test samples from three busy loop failure cases with
different impacts, in which 30, 65, and 100 busy loops areinjected into an EJB component, respectively. Results showthat the SP E values for these failures are separated, andthe failure with stronger significance produces largerSP E values. The SP Es computed by the linear modelare also shown in Figs. 14a, 14c, 14e, and 14g. Comparedwith those from the nonlinear model, the SP E values fromthe linear model are overlapped and lack strong evidenceabout the significance of injected failures.
Our failure localization procedure also produces satis-factory results. Here, we use variable ranking from SP E todemonstrate this. We randomly select 200 samples from the
failure measurements whose SP E values are affected.
Among the 200 selected data, each type of failure occupies
40 samples, which are continuously indexed. We apply the
attribute ranking method described in Section 5 and output
a vector i, which contains the indices of the five most
suspicious attributes. In total, we generate 200 such vectors
i, i 1; ; 200. In order to see whether these attribute
ranking results really tell any evidence about the injected
failure, we perform hierarchical clustering on is, in which
the Jaccard coefficient J i; j
ji\jj
ji[jj is used for calculatingthe similarity between i and j. We start by assigning each
vector i to its own cluster and, then, we merge clusters
with the largest average similarity until five clusters are
obtained. Results show that the vector is from the same
type of failure provides consistent variable ranking results.
Table 1 plots the cluster indices associated with each failure
measurement. We see that all the vectors belonging to the
memory leaking (with indices 1-40) and file missing (41-80)
failures form separate clusters. Most of the vectors from the
busy-loop failure (81-120) form one cluster, except five noisy
points. However, it is hard to separate the null call (121-160)
and expected exception (161200) failures. Actually, this is
reasonable, since these two types of failures are generated by similar mechanisms. Therefore, our proposed failure
localization method provides consistent and useful evi-
dence about the failure. It is our future work to look further
into the clustered vectors and reveal the signatures for each
type of failure based on its suspicious attributes. By doing
so, we can quickly identify and solve recurrent system
failures by retrieving the similar signatures from historic
failures.
7 CONCLUSIONS
This paper has presented a method for monitoring high-
dimensional data in information systems based on theobservation that the high-dimensional measurements are
usually located in a low-dimensional structure embedded in
the original space. We have developed both linear and
nonlinear algorithms to discover the underlying low-
dimensional structure of data. Two statistics, the Hotelling
T2 and SPE, have been used to represent the data variations
within and outside the revealed structure. Based on the
probabilistic density of these statistics, we have successfully
detected a variety of simulated failures in a J2EE-based Web
application. In addition, we have discovered a list of
suspicious attributes for each detected failure, which are
helpful in finding the failure root cause.
24 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 1, JANUARY 2008
Fig. 14. (a) and (b) The histograms of SP E of normal test samplesproduced by the linear and nonlinear models. (c) and (d) The histogramsof SP E by the linear and nonlinear models for samples from the busy-loop failures, with 30 loops injected. (e) and (f) The histograms of SP Eby the linear and nonlinear models for samples from the busy-loopfailures, with 65 loops injected. (g) and (h) The histograms of SP E bythe linear and nonlinear models for samples from the busy-loop failures,with 100 loops injected.
TABLE 1The Clustering Results of Failure Measurements Based
on the Outcomes of Attribute Ranking
Fig. 13. The ROC curves for failure detectors based on the linear and
nonlinear models.
Authorized licensed use limited to: NEC Labs. Downloaded on May 4, 2009 at 18:58 from IEEE Xplore. Restrictions apply.
8/3/2019 Monitoring Failure
13/13
REFERENCES[1] C.C. Aggarwal and P.S. Yu, Outlier Detection for High-
Dimensional Data, Proc. ACM SIGMOD 01, pp. 37-46, 2001.[2] M.K. Aguilera, W. Chen, and S. Toueg, Using the Heartbeat
Failure Detector for Quiescent Reliable Communication andConsensus in Partitionable Networks, Theoretical ComputerScience, special issue on distributed algorithms, vol. 220, pp. 3-30, 1999.
[3] T.W. Anderson, An Introduction to Multivariate Statistical Analysis,second ed. Wiley, 1984.[4] M. Balasubramanian and E.L. Schwartz, The Isomap Algorithm
and Topological Stability, Science, vol. 295, no. 7, 2005.[5] P. Barham, R. Isaacs, R. Mortier, and D. Narayanan, Magpie:
Real-Time Modeling and Performance-Aware Systems, Proc.Ninth Workshop Hot Topics in Operating Systems (HotOS 03), May2003.
[6] P. Bodik et al., Combining Visualization and Statistical Analysisto Improve Operator Confidence and Efficiency for FailureDetection and Localization, Proc. Second Intl Conf. AutonomicComputing (ICAC 05), pp. 89-100, June 2005.
[7] M. Brand, Charting a Manifold, Advances in Neural InformationProcessing Systems 15, MIT Press, 2003.
[8] T. Brotherton and T. Johnson, Anomaly Detection for AdvancedMilitary Aircraft Using Neural Networks, Proc. IEEE AerospaceConf., pp. 3113-3123, 2001.
[9] M. Chen, E. Kiciman, E. Fratkin, A. Fox, and E. Brewer, Pinpoint:Problem Determination in Large Dynamic Systems, Proc. IntlPerformance and Dependability Symp. (IPDS 02), June 2002.
[10] HP OpenView, HP Corp., http://www.openview.hp.com/, 2007.[11] M.J. Desforges, P.J. Jacob, and J.E. Cooper, Applications of
Probability Density Estimation to the Detection of AbnormalConditions in Engineering, Proc. Inst. of Mechanical Eng.Part C:
J. Mechanical Eng. Science, vol. 212, pp. 687-703, 1998.[12] G. Eckart and G. Young, The Approximation of One Matrix by
Another of Low Rank, Psychometrica, vol. 1, pp. 211-218, 1936.[13] R.D. Fierro, G.H. Golub, P.C. Hansen, and D.P. OLeary,
Regularization by Truncated Total Least Squares, SIAM J.Scientific Computing, vol. 18, pp. 1223-1241, 1997.
[14] W. Fuller, Measurement Error Models. John Wiley & Sons, 1987.[15] A. Gionis, P. Indyk, and R. Motwani, Similarity Search in High
Dimensions via Hashing, Proc. 25th Intl Conf. Very Large DataBases (VLDB 99), pp. 518-529, 1999.
[16] G.H. Golub and C.F. Van Loan, Matrix Computations, third ed.Johns Hopkins Univ. Press, 1996.[17] P. Grassberger and I. Procaccia, Measuring the Strangeness of
Strange Attractors, Physica D, vol. 9, pp. 189-208, 1983.[18] A. Hoskuldsson, PLS Regression Methods, J. Chemometrics,
vol. 2, no. 3, pp. 211-228, 1988.[19] Tivoli Business System Manager, IBM, http://www.tivoli.com/,
2007.[20] T. Ide and H. Kashima, Eigenspace-Based Anomaly Detection in
Computer Systems, Proc. ACM SIGKDD 04, pp. 440-449, Aug.2004.
[21] G. Jiang, H. Chen, C. Ungureanu, and K. Yoshihira, Multi-Resolution Abnormal Trace Detection Using Varied-Lengthn-Grams and Automata, Proc. Second Intl Conf. AutonomicComputing (ICAC 05), pp. 111-122, June 2005.
[22] G. Jiang, H. Chen, and K. Yoshihira, Discovering LikelyInvariants of Distributed Transaction Systems for Autonomic
System Management, Proc. Third Intl Conf. Autonomic Computing(ICAC 06), pp. 199-208, June 2006.[23] I.T. Jolliffe, Principal Component Analysis. Springer Verlag, 1986.[24] T. Kourti and J.F. MacGregor, Recent Developments in Multi-
variate SPC Methods for Monitoring and Diagnosing Process andProduct Performance, J. Quality Technology, vol. 28, no. 4, pp. 409-428, 1996.
[25] R. Kozma, M. Kitamura, M. Sakuma, and Y. Yokoyama,Anomaly Detection by Neural Network Models and StatisticalTime Series Analysis, Proc. IEEE World Congress on ComputationalIntelligence 94, pp. 3207-3210, 1994.
[26] K. Yamanishi, J. Takeuchi, G. Williams, and P. Milne, On-LineUnsupervised Outlier Detection Using Finite Mixtures withDiscounting Learning Algorithms, Proc. Sixth ACM SIGKDD00, pp. 320-344, 2000.
[27] M. Markou and S. Singh, Novelty Detection: A ReviewPart 1:Statistical Approaches, Signal Processing, vol. 83, pp. 2481-2497,
[28] M. Markou and S. Singh, Novelty Detection: A ReviewPart 2:Neural Network Based Approaches, Signal Processing, vol. 83,pp. 2499-2521, 2003.
[29] L. Mirsky, Symmetric Gauge Functions and Unitarily InvariantNorms, Quarterly J. Math. Oxford, vol. 11, pp. 50-59, 1960.
[30] M.J. Piovoso, K.A. Kosanovich, and J.P. Yuk, Process DataChemometrics, IEEE Trans. Instrumentation and Measurement,vol. 41, no. 2, pp. 262-268, 1992.
[31] R.A. Redner and H.F. Walker, Mixture Densities, Maximum
Likelihood and the EM Algorithm, SIAM Rev., vol. 26, pp. 195-239, 1984.[32] S. Roweis and L. Saul, Nonlinear Dimensionality Reduction by
Locally Linear Embedding, Science, vol. 290, pp. 2323-2326, 2000.[33] N.K. Shah and P.J. Gempcrlinc, Combination of the Mahalanobis
Distance and Residual Variance Pattern Recognition Techniquesfor Classification of Near-Infrared Reflectance Spectra, J. Am.Chemical Soc., vol. 62, no. 5, pp. 465-470, 1990.
[34] D.M.J. Tax and R.P.W. Duin, Support Vector Domain Descrip-tion, Pattern Recognition Letters, vol. 20, pp. 1191-1199, 1999.
[35] J.B. Tenenbaum, V. de Silva, and J.C. Langford, A GlobalGeometric Framework for Nonlinear Dimensionality Reduction,Science, vol. 290, pp. 2319-2323, 2000.
[36] S. Van Huffel and J. Vandewalle, The Total Least Squares Problem.Computational Aspects and Analysis. Soc. for Industrial and AppliedMath., 1991.
Haifeng Chen received the BEng and MEngdegrees in automation from the SoutheastUniversity, China, in 1994 and 1997, respec-tively, and the PhD degree in computer engi-neering from Rutgers University, New Jersey, in2004. He was a researcher at the ChineseNational Research Institute of Power Automa-tion. He is currently a research staff member atthe NEC Laboratories America, Princeton, NewJersey. His research interests include data
mining, autonomic computing, pattern recognition, and robust statistics.
Guofei Jiang received the BS and PhD degreesin electrical and computer engineering fromBeijing Institute of Technology, Beijing, in 1993and 1998, respectively. From 1998 to 2000, hewas a postdoctoral fellow in computer engineer-
ing at Dartmouth College, New Hampshire. He iscurrently a senior research staff member withthe Robust and Secure Systems Group, NECLaboratories America, Princeton, New Jersey.His current research interests include distributed
systems, dependable and secure computing, and system and informa-tion theory. He has published nearly 50 technical papers in these areas.He is an associate editor for IEEE Security and Privacyand has servedin the program committees of many prestigious conferences.
Kenji Yoshihira received the BE degree inelectrical engineering from the University ofTokyo in 1996 and the MS degree in computerscience from New York University in 2004. Forfive years, he was with Hitachi, where hedesigned processor chips for enterprise compu-ters. Until 2002, he was a chief technical officer
(CTO) at Investoria Inc., Japan, where hedeveloped an Internet service system for finan-cial information distribution. He is currently a
research staff member with the Robust and Secure Systems Group,NEC Laboratories America, Inc., New Jersey. His current researchinterests include distributed systems and autonomic computing.
. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.
CHEN ET AL.: MONITORING HIGH-DIMENSIONAL DATA FOR FAILURE DETECTION AND LOCALIZATION IN LARGE-SCALE COMPUTING... 25