Anomaly in Manet2

Embed Size (px)

Citation preview

  • 8/10/2019 Anomaly in Manet2

    1/14

    IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 51, NO. 8, AUGUST 2003 2191

    Anomaly Detection in IP NetworksMarina Thottan and Chuanyi Ji

    Abstract Network anomaly detection is a vibrant researcharea. Researchers have approached this problem using varioustechniques such as artificial intelligence, machine learning, andstate machine modeling. In this paper, we first review theseanomaly detection methods and then describe in detail a statisticalsignal processing technique based on abrupt change detection. Weshow that this signal processing technique is effective at detectingseveral network anomalies. Case studies from real network datathat demonstrate the power of the signal processing approachto network anomaly detection are presented. The application of signal processing techniques to this area is still in its infancy, andwe believe that it has great potential to enhance the field, andthereby improve the reliability of IP networks.

    Index Terms Adaptive signal processing, autoregressive pro-cesses, eigenvalues and eigenfunctions, network performance, net-

    work reliability.

    I. INTRODUCTION

    N ETWORKS arecomplex interacting systems andarecom-prised of several individual entities such as routers andswitches. The behavior of the individual entities contribute tothe ensemble behavior of the network. The evolving nature of internet protocol (IP) networks makes it difficult to fully un-derstand the dynamics of the system. Internet traffic was firstshown to be composed of complex self-similar patterns by Le-land et al. [1]. Multifractal scaling was discovered and reportedby Levy-Vehel et al. [2]. To obtain a basic understanding of

    the performance and behavior of these complex networks, vastamounts of information need to be collected and processed.Often, network performance information is not directly avail-able, and the information obtained must be synthesized to ob-tain an understanding of the ensemble behavior. In this paper,we review the use of signal processing techniques to addressthe problem of measuring, analyzing, and synthesizing network information to obtain normalnetwork behavior. The normal net-work behavior thus computed is then used to detect network anomalies.

    There are two main approaches to studying or characterizingthe ensemble behavior of the network: The first is the infer-ence of the overall network behavior through the use of network probes [3] and the second by understanding the behavior of theindividual entities or nodes. In the first approach, which is often

    Manuscript received October 4, 2002; revised April 2, 2003. This work wassupported by DARPA under Contract F30602-97-C-0274 and the National Sci-ence Foundation under Grant ECS 9908578. The associate editor coordinatingthe review of this paper and approving it for publication was Dr. Rolf Riedl.

    M. Thottan is with the Department of Networking Software Research, BellLaboratories, Holmdel, NJ 07733 USA (e-mail: [email protected]).

    C. Ji is with the Department of Electrical, Computer, and Systems Engi-neering, Georgia Institute of Technology, Atlanta, GA 30332 USA (e-mail: [email protected]).

    Digital Object Identifier 10.1109/TSP.2003.814797

    referred to as network tomography [ 4], there is no assumptionmade about the network, and through the use of probe measure-ments, one can infer the characteristics of the network. This is auseful approach when characterizing noncooperative networksor networks that are not under direct administrative control. Inthe case of a single administrative domain where knowledge of the basic network characteristics such as topology are available,an entity-based study would provide more useful informationto the network administrator. Using some basic knowledge of the network layout as well as the traffic characteristics at theindividual nodes, it is possible to detect network anomalies andperformance bottlenecks. The detection of these events can thenbe used to trigger alarms to the network management system,

    which, in turn, trigger recovery mechanisms. The methods pre-sented in this paper deal with entity-based measurements.The approaches used to address the anomaly detection

    problem are dependent on the nature of the data that is avail-able for analysis. Network data can be obtained at multiplelevels of granularity such as end-user-based or network-based.End-user-based information refers to the transmission controlprotocol (TCP) and user datagram protocol (UDP) related datathat contains information that is specific to the end application.Network-based data pertains to the functioning of the network devices themselves and includes information gathered fromthe routers physical interfaces as well as from the routersforwarding engine. Traffic counts obtained from both types of

    data can be used to generate a time series to which statisticalsignal processing techniques can be applied [ 5], [6]. However,in some cases, only descriptive information such as the numberof open TCP connections, source-destination address pairs, andport numbers are available. In such situations, conventionalapproaches of rule-based methods would be more useful [ 7].

    The goal of this paper is to show the potential to apply signalprocessing techniques to the problem of network anomalydetection. Application of such techniques will provide betterinsight for improving existing detection tools as well as providebenchmarks to the detection schemes employed by thesetools. Rigorous statistical data analysis makes it possible toquantify network behavior and, therefore, more accuratelydescribe network anomalies. The scope of this paper is todescribe the problem of IP network anomaly detection in asingle administrative domain along with the types and sourcesof data available for analysis. Special emphasis is placed onmotivating the need for signal processing techniques to studythis problem. We describe areas in which advances in signaldetection theory have been useful. For example, there are noaccurate statistical models for normal network operation, andthis makes it difficult to characterize the statistical behavior of abnormal traffic patterns. In this paper, we present a techniquebased on abrupt change detection for addressing this challenge

    1053-587X/03$17.00 2003 IEEE

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/10/2019 Anomaly in Manet2

    2/14

    2192 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 51, NO. 8, AUGUST 2003

    [8]. Furthermore, there is no single variable or metric thatcaptures all aspects of normal network function. This presentsthe problem of synthesizing information from multiple metrics,each of which have widely differing statistical properties.To address this issue, we use an operator matrix to correlateinformation from individual metrics [ 5]. The paper also iden-tifies some of the challenges posed to the signal processing

    community by this new application.The paper is organized as follows: Section II describes the

    characteristics of network anomalies along withsomeexamples.The different sources from which network data can be obtainedare described in Section III. Section IV surveys the commonlyused methods for anomaly detection. In Section V, we describea statistical technique to solve the same problem. We present anevaluation criteria for the effectiveness of the signal processingtechnique and summarize its performance in detecting variousnetwork anomalies. We describe four in-depth case studies onanomalous events in real networks as well as the performanceof the statistical approach in each of these cases. Section VI de-scribes relevant issues in the application of signal processingtechniques to network data analysis. In Section VII, we discusssome of the open issues that provide interesting avenues to ex-plore for future work.

    II. NETWORK ANOMALIES

    Network anomalies typically refer to circumstances whennetwork operations deviate from normal network behavior.Network anomalies can arise due to various causes such asmalfunctioning network devices, network overload, maliciousdenial of service attacks, and network intrusions that disrupt thenormal delivery of network services. These anomalous eventswill disrupt the normal behavior of some measurable network

    data. In this paper, we present techniques that can be employedto detect such types of anomalies.

    The definition of normal network behavior for measured net-work data is dependent on several network specific factors suchas the dynamics of the network being studied in terms of trafficvolume, the type of network data available, and types of appli-cations running on the network. Accurate modeling of normalnetwork behavior is still an active field of research, especiallythe online modeling of network traffic [ 9]. There exists parsi-monious traffic models that accurately capture fractal and mul-tifractal scaling properties, such as the self-similar models in-troduced by Norros [ 10] and cascade models originally sug-gested by Crousse et al. [11]. While Crouse et al. concentrate on

    the signal processing issues and statistical matching to network traffic, Gilbert et al. provide a more network centric descriptionof cascade-scaling [ 12]. Within these parsimonious modelingframeworks, different approaches can be used to do anomalydetection.

    Todays commercially available network management sys-tems continuously monitor a setof measured indicators to detectnetwork anomalies. A human network manager observes thealarm conditions or threshold violations generated by a groupof individual indicators to determine the status of the health of the network. Such alarm conditions represent deviations fromnormal network behavior and can occur before or during ananomalous event. These deviations are often associated with

    performance degradation in the network. Based on this intu-itive model currently being used, we premise the following: Network anomalies are characterized by correlated transient changes in measured network data that occur prior to or duringan anomalousevent . The term transient changes refers to abruptchanges in the measured data that occurs in the same order asthefrequency of themeasurement interval.Theduration of these

    abrupt changes varies with the nature of the triggering anoma-lous event.

    A. Examples of Network Anomalies

    Network anomalies can be broadly classified into twocategories. The first category is related to network failuresand performance problems. Typical examples of network performance anomalies are file server failures, paging acrossthe network, broadcast storms, babbling node, and transientcongestion [ 13], [14]. For example, file server failures, such asa web server failure, could occur when there is an increase inthe number of ftp requests to that server. Network paging errorsoccur when an application program outgrows the memorylimitations of the work station and begins paging to a network file server. This anomaly may not affect the individual userbut affects other users on the network by causing a shortage of network bandwidth. Broadcast storms refer to situations wherebroadcast packets are heavily used to the point of disabling thenetwork. A babbling node is a situation where a node sendsout small packets in an infinite loop in order to check for someinformation such as status reports. Congestion at short timescales occurs due to hot spots in the network that may be aresult of some link failure or excessive traffic load at that pointin the network. In some instances, software problems can alsomanifest themselves as network anomalies, such as a protocolimplementation error that triggers increased or decreased trafficload characteristics. For example, an accept protocol error in asuper server ( inetd ) results in reduced access to the network,which, in turn, affects network traffic loads.

    The second major category of network anomalies is secu-rity-related problems. Denial of service attacks and network intrusions are examples of such anomalies. Denial of serviceattacks occur when the services offered by a network arehijacked by some malicious entity. The offending party coulddisable a vital service such as domain name server (DNS)lookups and cause a virtual shutdown of the network [ 15], [16].For this event, the anomaly may be characterized by very lowthroughput. In case of network intrusions, the malicious entitycould hijack network bandwidth by flooding the network withunnecessary traffic, thus starving other legitimate users [ 6],[17]. This anomaly would result in heavy traffic volumes.

    III. SOURCES OF NETWORK DATA

    Obtaining the right type of network performance data is es-sential for anomaly detection. The types of anomalies that canbe detected are dependent on the nature of the network data. Inthis section, we review some possible sources of network dataalong with their relevance for detecting network anomalies. Forthe purpose of anomaly detection, we must characterize normaltraffic behavior. The more accurately the traffic behavior can bemodeled, the better the anomaly detection scheme will perform.

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/10/2019 Anomaly in Manet2

    3/14

    THOTTAN AND JI: ANOMALY DETECTION IN IP NETWORKS 2193

    A. Network Probes

    Network probes are specialized tools such as ping and tracer-oute [18] that can be used to obtain specific network parameterssuch as end-to-end delay and packet loss. Probing tools providean instantaneous measure of network behavior. These methodsdo not require the cooperation of the network service provider.However, it is possible that the service providers could choosenot to allow ping traffic through their firewall. Furthermore, thespecialized IP packets used by these tools need not follow thesame trajectory or receive the same treatment by network de-vices as do the regular IP packets. This method also assumes theexistence of symmetric paths between given source-destinationpairs. On the Internet, this assumption cannot be guaranteed.Thus, performance metrics derived from such tools can provideonly a coarse grained view of the network. Therefore, the dataobtained from probing mechanisms may be of limited value forthe purpose of anomaly detection.

    B. Packet Filtering for Flow-Based Statistics

    In packet filtering, packet flows are sampled by capturingthe IP headers of a select set of packets at different points inthe network [ 19]. Information gathered from these IP headersis then used to provide detailed network performance informa-tion. For flow-based monitoring, a flow is identified by source-destination addresses and source-destination port numbers. Thepacket filtering approach requires sophisticated network sam-pling techniques as well as specialized hardware at the network devices to do IP packet lookup. Data obtained from this methodcould be used to detect anomalous network flows. However, thehardware requirements required for this measurement methodmakes it difficult to use in practice.

    C. Data From Routing Protocols

    Information about network events can be obtained throughthe use of routing peers. For example by using an open shortestpath first (OSPF) peer, it is possible to gather all routing tableupdates that are sent by the routers [ 20]. The data collected canbe used to build the network topology and provides link statusupdates. If the routers run OSPF with traffic engineering (TE)extensions, it is possible to obtain link utilization levels [ 21].Since routing updates occur at frequent intervals, any change inlink utilization will be updated in near real time. However, sincerouting updates must be kept small, only limited information

    pertaining to link statistics can be propagated through routingupdates.

    D. Data From Network Management Protocols

    Network management protocols provide information aboutnetwork traffic statistics. These protocols support variables thatcorrespond to traffic counts at the device level. This informationfrom thenetwork devices canbe passively monitored. The infor-mation obtained may not directly provide a traffic performancemetric but could be used to characterize network behavior and,therefore, can be used for network anomaly detection. Usingthis type of information requires the cooperation of the service

    providers network management software. However, these pro-tocols provide a wealth of information that is available at veryfine granularity. The following subsection willdescribe this datasource in greater detail.

    1) Simple Network Management Protocol (SNMP): SNMPworks in a client-server paradigm [ 22]. The protocol provides amechanism to communicate between themanager and theagent.A singleSNMP manager canmonitor hundreds of SNMP agentsthat are located on the network devices. SNMP is implementedat the application layer and runs over the UDP. The SNMP man-ager has the ability to collect management data that is providedby the SNMP agent but does not have the ability to process thisdata. The SNMP server maintains a database of managementvariables called the management information base (MIB) vari-ables [ 23]. These variables contain information pertaining to thedifferent functions performed by the network devices. Althoughthis is a valuable resource for network management, we are onlybeginning to understand how this information can be used inproblems such as failure and anomaly detection.

    Every network device has a set of MIB variables that are spe-cific to its functionality. MIB variables are defined based onthe type of device as well as on the protocol level at whichit operates. For example, bridges that are data link-layer de-vices contain variables that measure link-level traffic informa-tion. Routers that are network-layer devices contain variablesthat provide network-layer information. The advantage of usingSNMP is that it is a widely deployed protocol and has beenstandardized for all different network devices. Due to the fine-grained data available from SNMP, it is an ideal data source fornetwork anomaly detection.

    2) SNMPMIB Variables: The MIB variables [ 24] fall intothe following groups: system, interfaces ( if ), address translation(at ), internet protocol ( ip), internet control message protocol(icmp ), transmission control protocol ( tcp ), user datagram pro-tocol ( udp ), exteriorgateway protocol ( egp ), andsimple network management protocol ( snmp ). Each group of variables describesthe functionality of a specific protocol of the network device.Depending on the type of node monitored, an appropriate groupof variables can be considered. If the node being monitored isa router, then the ip group of variables are investigated. The ipvariables describe the traffic characteristics at the network layer.MIB variables are implemented as counters.Time seriesdata foreach MIBvariable is obtained by differencing theMIB variablesat two subsequent time instances called the polling interval.

    There is no single MIB variable that is capable of capturingall network anomalies or all manifestations of the same network anomaly. Therefore, the choice of MIB variables depends onthe perspective from which the anomalies are detected. For ex-ample, in the case of a router, the ip group of MIB is chosen,whereas for a bridge, the if group is used.

    IV. ANOMALY DETECTION METHODS

    In this section, we review the most commonly used net-work anomaly detection methods. The methods described arerule-based approaches, finite state machine models, patternmatching, and statistical analysis.

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/10/2019 Anomaly in Manet2

    4/14

  • 8/10/2019 Anomaly in Manet2

    5/14

    THOTTAN AND JI: ANOMALY DETECTION IN IP NETWORKS 2195

    Fig. 1. Schematic representation of the proposed model for network anomalies.

    sequential change point detection approach. The Flooding Detection System , which was proposed by the authors in [ 6],uses measured network data that describes TCP operations todetect SYN 1 flooding attacks. SYN flooding attacks capitalizeon the limitation that TCP servers maintain all half-openconnections. Once the queue limit is reached, future TCPconnection requests are denied. The sequential change pointdetection employed here makes use of the nonparametriccumulative sum (CUSUM) method. Using this approach ontrace-driven simulations, it has been shown that SYN floodingattacks can be detected with high accuracy and reasonablyshort detection times.

    When detecting anomalies due to failures, we are confrontedwith the problem of detecting a host of potential failure sce-narios. Each of these failure scenarios differ in their manifes-tations as well as their characteristics. Thus, it is necessary toobtain a rich set of network information that could cover a widevariety of network operations. The primary source for such indepth information is in the SNMP MIBdata. Designing a failuredetection system using MIB data necessitates the use of a gen-eral method since MIB variables exhibit varying statistical char-acteristics [ 5]. Furthermore, there is no accurate fault modelavailable. The following section describes such a general failuredetection approach from the perspective of a router.

    V. ANOMALY DETECTION USING STATISTICALANALYSIS OF SNMP MIB

    MIB variables provide information that is specific to the in-dividual network devices. Since this work is on detecting net-work anomalies at the resolution of the device-level, this datasource is sufficient. Furthermore, the widespread deploymentand standardization of SNMP makes this data readily availableon network devices. In this section, a statistical analysis methodwe developed using the theory of change detection is discussedin greater detail, along with its advantages. We also provide adetection theory-based performance criteria to evaluate the ef-fectiveness of our approach. We present case studies of fourdifferent performance anomalies and provide some intuitive ex-planation on the usefulness of signal processing techniques todetect such network anomalies.

    1SYN means packets used to synchronize sequence numbers to initiate a con-nenction.

    A. Characterization of Network Anomalies

    In statistical analysis, a network anomaly is modeled as cor-related abrupt changes in network data. An abrupt change is de-fined as any change in the parameters of a time series that occurson the order of the sampling period of the measurement. For ex-ample, when the sampling period is 15 s, an abrupt change is de-fined as a change that occurs in the period of approximately 15 s.This approach models the intuition of network managers whouse hard threshold violations to generate alarms. The schemeused by most commercial management tools is similar to ma- jority voting. However, statistical signal processing techniquesreduce the number of false alarms as well as increase the proba-bility of detection as compared with simple majority voting andhard thresholds [ 5].

    Abrupt changes in time series data can be modeled using anauto-regressive (AR) process [ 8]. The assumption here is thatabrupt changes are correlated in time, yet are short-range de-

    pendent. In our approach, we use an AR process of orderto model the data in a 5-min window. Intuitively, in the eventof an anomaly, these abrupt changes should propagate throughthe network, and they can be traced as correlated events amongthe different MIB variables. This correlation property helps dis-tinguish the abrupt changes intrinsic to anomalous situationsfrom the random changes of the variables that are related tothe networks normal function. Therefore, we propose that net-work anomalies can be defined by their effect on network trafficas follows: Network anomalies are characterized by traffic-re-lated MIB variables undergoing abrupt changes in a correlated fashion . A pictorial representation of this is provided in Fig. 1.

    Using the above model for network anomalies, the anomaly

    detection problem can be posed as follows:Given a sequence of traffic-related MIB variables sampled at a fixed interval, generate a network health function that canbe used to declare alarms corresponding to anomalous network events .

    To detectanomalies from theperspective of a router, we focuson the ip layer. Three MIB variables are chosen from the ip MIBgroup. These variables represent a cross section of the trafficseen at the router. The variable ipIR (which stands for In Re-ceives), represents the total number of datagrams received fromall the interfaces of the router. ipIDe (which stands for In De-livers), represents the number of datagrams correctly deliveredto thehigher layersas this node was their finaldestination. ipOR

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/10/2019 Anomaly in Manet2

    6/14

    2196 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 51, NO. 8, AUGUST 2003

    Fig. 2. Contiguous piecewise stationary windows Learning window Test window.

    (which stands for Out Requests) represents the number of data-grams passed on from the higher layers of the node to be for-warded by the ip layer. The traffic associated with the ipIDeand ipOR variables comprise only a fraction of the entire net-work traffic. However, in the event of an anomaly, these arerelevant variables since the router does some route table pro-

    cessing, which would be reflected in these variables. The threeMIB variables chosen are not strictly independent. The averagecross correlation of ipIR with ipIDe is 0.08 and with ipOR is0.05. The average cross correlation between ipOR and ipIDe is0.32.

    B. Abrupt Change Detection

    In our statistical approach, the network health function is ob-tained using a combination of abnormality indicators from theindividual MIB variables. The abnormality in the MIB data isdetermined by detecting abrupt changes in their statistics. Sincethe statistical distribution of the individual MIB variables aresignificantly different it is difficult to do joint processing of these variables. Therefore, the abrupt changes in each of theMIBvariables is firstobtained.Changedetection is done using ahypothesis test based on the generalized likelihood ratio (GLR)[36]. This test provides an abnormality indicator that is scaledbetween 0 and 1.

    Abrupt changes are detected by comparing the variance of the residuals obtained from two adjacent windows of data thatare referred to as the learning and test windows,as shown in Fig. 2. Residuals are obtained by imposing an ARmodel on the time series data in each of the windows. The like-lihood ratio for a single variable is obtained as shown in (1)

    (1)

    where and are the variance of the residual in the learningwindow and the test window, respectively, ,where is the order of the AR process, and is the lengthof the learning window. Similarly, , where isthe length of the test window. is the pooled variance of thelearning and test windows. The abnormality indicators thus ob-tained from the individual MIB variables are collected to forman abnormality vector . The abnormality vector is ameasure of the abrupt changes in normal network behavior.

    C. Combining the Abnormality Vectors:

    The individual abnormality vectors must be combined to pro-vide a health function. This network health function is obtainedby incorporating the spatial dependencies between the abruptchanges in the individual MIB variables. This is accomplishedusing a linear operator . Such operators are frequently used in

    quantum mechanics [ 37]. The linear operator is designed basedon the correlation between the chosen MIB variables. In partic-ular, the quadratic functional

    (2)

    is used to generate a continuous scalar indicator of network health. This network health indicator is interpreted as a mea-sure of abnormality in the network, as perceived by the specificnode. The network health indicator is bounded between 0 and1 by an appropriate transformation of the operator [ 5]. In thenetwork health function, a value of 0 represents a healthy net-work, and a value of 1 represents maximum abnormality in the

    network.The operator matrix is an matrix ( is the number

    of MIB variables). In order to ensure orthogonal eigenvectorsthat form a basis for and real eigenvalues, the matrix isdesigned to be symmetric. Thus, it has orthogonal eigenvec-tors with real eigenvalues. A subset of these eigenvectors canbe identified to correspond to anomalous states in the network [5]. If and are the minimum and maximum eigenvaluesthat correspond to these anomalous states, the problem of de-tecting network anomalies can then be expressed as

    (3)

    where is the earliest time at which the functionalexceeds . By virtue of the design of the operator matrix (dis-cussed below), the function has an upper bound

    (4)

    Each time the condition expressed in (3) is satisfied, we have adeclaration of an anomalous condition.

    Design of the Operator Matrix : The primary goal of theoperator matrix is to incorporate the correlation between theindividual components of the abnormality vector. The abnor-mality vector is a vector with components

    (5)

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/10/2019 Anomaly in Manet2

    7/14

    THOTTAN AND JI: ANOMALY DETECTION IN IP NETWORKS 2197

    where each component of this vector corresponds to the abnor-mality associated with the individual MIB variables, as obtainedfrom the likelihood ratio test. In order to complete the basis setso that all possible states of the system are included, an addi-tional component that corresponds to the normal func-tioning of the network is added. The final component allows forproper normalization of the input vector. The new input vector

    (6)

    is normalized with as the normalization constant( ). By normalizing the input vectors theexpectation of the observable of the operator can beconstrained to lie between 0 and 1 [see (18)].

    The appropriate operator matrix will therefore be. We design the operator matrix to be Hermitian

    in order to have an eigenvector basis. Taking the normal stateto be uncoupled to the abnormal states, we get a block diagonalmatrix with an upper block and a 1 1 lower

    block, as in (7), shown at the bottom of the page.Theelementsof theupperblock of theoperatormatrix

    are obtained as follows: When , we have

    (8)

    (9)

    which is the the ensemble average of the two point spatial cross-correlation of the abnormality vectors estimated over a time in-terval [38]. For , we have

    (10)

    Using this transformation ensures that the maximum eigenvalueof the matrix is 1.

    The element indicates the contribution of thehealthy state to the indicator of abnormality for the network node. Since the healthy state should not contribute to the abnor-mality indicator, the component is assigned as ,which, in the limit, tends to 0. Therefore, for the purpose of de-tecting faults, we only consider the upper block of the matrix

    .The entries of thematrix describe how the operator causes the

    components of the input abnormality vector to mix with each

    other. The matrix is symmetric, real, and the elements

    are non-negative and, hence, the solution to the characteristicequation

    (11)

    consists of orthogonal eigenvectors with eigenvalues. The eigenvectors obtained are normalized to form an

    orthonormal basis set. The first components of cantherefore be decomposed as a linear combination of the eigen-vectors of , namely,

    (12)

    The th component of is present only for normaliza-tion purposes and is therefore omitted in the subsequent calcu-lation of network abnormality. Incorporating the spatial depen-dencies through the operator transforms the abnormality vector

    as

    (13)

    Here, measures the degree to which a given abnormalityvector falls along the th eigenvector. This value can beinterpreted as a probability amplitude and as the probabilityof being in the th eigenstate.

    A subset of the eigenvectors , where ,is called the fault vectorset and can be used to define a faulty re-gion. The fault vectors are chosen based on the magnitude of thecomponents of the eigenvector. The eigenvector that has com-ponents proportional to (since it is normalized) is identi-

    fied as the most faulty vector since it corresponds to maximumabnormality in all its components. Furthermore, based on ourfault model of correlated abrupt changes, the eigenvector pro-portional to the vector signifies the maximum correlationbetween all the abnormality indicators.

    If a given input abnormality vector can be completely ex-pressed as a linear combination of the fault vectors

    (14)

    then we say that the abnormality vector falls in the fault domain.The extent to which any given abnormality vector lies in the

    fault domain can be obtained in the following manner: Since

    (7)

    http://-/?-http://-/?-
  • 8/10/2019 Anomaly in Manet2

    8/14

    2198 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 51, NO. 8, AUGUST 2003

    any general abnormality vector is normalized, we have thefollowing condition:

    (15)

    As there are different valuesfor , an average scalar measure

    of the transformation in the input abnormality vector is obtainedby using the quadratic functional

    (16)

    Using (16) and the fact that the Kronecker delta,we have

    (17)

    (18)

    The measure is the indicator of the average abnormalityin the network as perceived by the node.

    Now, consider an input abnormality vector that lies com-pletely in the fault domain. Using the condition in (15), weobtain a bound for as

    (19)

    where are the eigenvalues corresponding to the set of faultvectors. Thus, using these bounds on the functional , analarm is declared when

    (20)

    Note that sincethe maximumeigenvalueof is1,. The maximum eigenvalue by design is associated with the

    most faulty eigenvector.1) Performance Evaluation of the Statistical Analysis

    Method: The performance of the statistical algorithm isexpressed in terms of the prediction time and the meantime between false alarms . Prediction time is the timeto the anomalous event from the nearest alarm preceding it.A true anomaly prediction is identified by a declaration thatis correlated with an accurate fault label from an indepen-dent source such as syslog messages and/or trouble tickets.Therefore, anomaly prediction implies two situations: a) In

    the case of predictable anomalies such as file server failuresand network access problems, true prediction is possible byobserving the abnormalities in the MIB data, and b) in the caseof unpredictable anomalies such as protocol implementationerrors, early detection is possible as compared with the existingmechanisms such as syslog messages and trouble reports. Anyanomaly declaration that did not coincide with a label wasdeclared a false alarm. The quantities used in studying theperformance of the agent are depicted in Fig. 3. is the numberof lags used to incorporate the persistence criteria in order todeclare alarms corresponding to fault situations. Persistencecriteria implies that for an anomaly to be declared, the alarmsmust persist for consecutive time lags.

    Fig. 3. Quantities used in performance analysis.

    TABLE ISUMMARY OF RESULTS OBTAINED USING THE STATISTICAL TECHNIQUE

    In some cases, alarms are obtained only after the anomaly hasoccurred. In these instances, we only detect the problem. Thetime for detection is measured as the time elapsed betweenthe occurrence of the anomaly and the declaration of the alarm.There are also some instances where alarms were obtained both

    preceding and after the anomaly. In these cases, the alarms thatfollow are attributed to the hysteresis effect of the anomaly.

    2) Application of the Statistical Approach to Network DataFrom SNMP MIB: The statistical techniques described abovehave been successfullyused to detect eight outof nine file serverfailures in the campus network and 14 out of 15 file server fail-ureson the enterprise network. Interestingly, the samealgorithmwith no modifications was able to detect all eight instances of network access problems, one protocol implementation error,and one run-away process on an enterprise network. 2 Thus, wesee that the statistical techniques are able to easily generalize todetect different types of network anomalies. Further evidenceof this is provided by using case studies from real network fail-ures. A summary of the results obtained using the statisticaltechniques is provided in Table I. It was observed that a plainmajority voting scheme on the variable level abnormality indi-cators was able to detect only file server failures and not any of the other three types of failures [ 13].

    3) Case Studies: In this section, we present examplesof net-work failures obtained from two different production networks:an enterprise network and a campus network. Both these net-works were being actively monitored and were well designed tomeet customer requirements. The types of anomalies observed

    2In the collected data set, there was only one instance each of protocol imple-mentation error and runaway process

    http://-/?-http://-/?-
  • 8/10/2019 Anomaly in Manet2

    9/14

    THOTTAN AND JI: ANOMALY DETECTION IN IP NETWORKS 2199

    Fig. 4. Case (1) file server failure. Average abnormality at the router.

    were the following: file server failures, protocol implementa-tion errors, network access problems, and runaway processes[5]. Most of these anomalous events were due to abnormal useractivity, except for protocol implementation errors. However, allof these cases did affect the normal characteristics of the MIBdata andimpaired thefunctionality of thenetwork. We show thatby using signal processing techniques, it is possible to detect thepresence of network anomalies prior to their being detected bythe existing alarm systems such as syslog 3 messages and troubletickets.

    Case Study (1): File Server Failure Due to Abnormal User Behavior: In this case study, we describe a scenario corre-sponding to a file server failure on one of the subnets of thecampus network. Twelve machines on the same subnet and 24machines outside the subnet reported the problem via syslogmessages. The duration of the failure was from 11:10 am to11:17 am (7 min) on December 5, 1995, as determined by thesyslog messages. The cause of the file server failure was anunprecedented increase in user traffic ( ftp requests) due to therelease of a new web-based software package.

    This case study represents a predictable network problemwhere the traffic related MIB variables show signs of abnor-mality before the occurrence of the file server failure. The faultwaspredicted21 minbefore theserver crash occurred. Figs. 47show the output of the statistical algorithm at the router and in

    the individual ip layer variables. The fault period is shown byvertical dotted lines. In Fig. 4, for router health, the x denotesthe alarms that correspond to input vectors that are abnormal.

    The variable level indicators capture the trends in abnor-mality. Note that there is a drop in the mean level of the trafficin the ipIR variable immediately prior to the failure. Among thethree variables considered, the variables ipOR and ipIDe arethe most well-behaved, i.e., not bursty, and the ipIR variable isbursty. Because the ipIDE and ipOR variables are less bursty,they lend easily to conventional signal processing techniques(see Figs. 6 and 7).

    3Syslog messages are system generated messages in response to some failure.

    Fig. 5. Case (1) file server failure. Abnormality indicator of ipIR .

    Fig. 6. Case (1) file server failure. Abnormality indicator of ipIDe .

    Fig. 7. Case (1) file server failure. Abnormality indicator of ipOR .

    http://-/?-http://-/?-
  • 8/10/2019 Anomaly in Manet2

    10/14

    2200 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 51, NO. 8, AUGUST 2003

    Fig. 8. Case (2) protocol error. Average abnormality at the router.

    However, the combined information from all the three vari-ables is able to capture the traffic behavior at the time of theanomaly, as shown in Fig. 4. There are very few alarms at therouter level, and the mean time between false alarms in this casewas 1032 min (approx 17 h).

    Case Study (2): Protocol Implementation Errors: This casestudy is onewhere thefault itself is notpredictable butthe symp-toms of the fault can be observed. Typically, a protocol imple-mentationerror is an undetected software problem that could gettriggered by a specific set of circumstances on the network. Onesuch fault detected on the enterprise network was that of a superserver inetd protocol error. The super server is the server thatlistens for incoming requests for various network servers, thus

    serving as a single daemon that handles all server requests fromthe clients. The existence of the fault was confirmed by syslogmessages and trouble tickets. Syslog messages reported an inetd error. In addition, other faulty daemon process messages werealso reported during this time. Presumably, these faulty daemonmessages are related to the super server protocol error. Duringthe same time interval, trouble tickets also reported problemssuch as the inability to connect to the web server, send mail,or print on the network printer, as well as difficulty in loggingonto the network. The super server protocol problem is of con-siderable interest since it affected the overall performance of thenetwork for an extended period of time.

    The prediction time of this network failure relative to the

    syslog messages was 15 min. The existing trouble ticketingscheme only responded to the fault situation and, hence,detected the failure after its onset.

    Figs. 811 show the alarms generated at the router and theabnormality indicators at the individual variables. Again, thecombined information from all the three variables captures theabnormal behavior leading up to the fault. Note that since thisproblem was a network-wide failure, the ipIR andthe ipIDe vari-ables show significant changes around the fault region. Sincethere was a distinct change in behavior in two of the three vari-ables, the combined abnormality at the router was less prone toerror. Thus, there were no false alarms reported in this data set,but rather, persistent alarms were observed just before the fault.

    Fig. 9. Case (2) protocol error. Abnormality indicator of ipIR .

    Fig. 10. Case (2) protocol error. Abnormality indicator of ipIDe .

    Fig. 11. Case (2) protocol error. Abnormality indicator of ipOR .

  • 8/10/2019 Anomaly in Manet2

    11/14

    THOTTAN AND JI: ANOMALY DETECTION IN IP NETWORKS 2201

    Fig. 12. Case (3) network access. Average abnormality at the router.

    Fig. 13. Case (3) network access. Abnormality indicator of ipIR .

    Case Study (3): Network Access Problems: Network accessproblems were reported primarily in the trouble tickets. Thesefaults were often not reported by the syslog messages. Due to theinherent reactive nature of trouble tickets, it is hard to determinethe exact time when the problem occurred. The trouble reportsreceived ranged from the network being slow to inaccessibility

    of an entire network domain. The prediction time was 6 min.The mean time between false alarms was 286 min (4 h and 46min). Figs. 1215 show the alarms obtained at the router levelas well as the abnormality indicators at the variables. Note thatthe ipIR variable shows a gradual increase in the baseline as itnears the fault region.

    Case Study (4): Runaway Processes: A runaway process isan example of high network utilization by some culprit userthat affects network availability to other users on the network.A runaway process is an example of an unpredictable fault butwhose symptoms can be used to detect an impending failure.This is a commonlyoccurring problem in mostcomputation-ori-ented network environments. Runaway processes are known to

    Fig. 14. Case (3) network access. Abnormality indicator of ipIDe .

    Fig. 15. Case (3) network access. Abnormality indicator of ipOR .

    be a security risk to the network. This fault was reported by thetrouble tickets but much after thenetwork had run out of processidentification numbers. In spite of having a large number of syslog messages generated during this period, there was no clearmessage indicating that a problem had occurred. The predictiontimewas 1 min, and the mean timebetween false alarms was 235min (about 4 h). Figs. 1619 show the performance of the statis-

    tical technique in the detection of the runaway process. In thisscenario, the ipIR variable shows a noticeable change in meanimmediately prior to the fault being detected by conventionalschemes such as syslog and trouble tickets. However, the statis-tical analysis method captures this anomalous behavior aheadof the syslog reports, as seen in the ipIR variable, as well as inthe combined router indicator.

    4) Effectiveness of Statistical Techniques: The effectivenessof the statistical approach can be seen from its ability to dis-tinguish between different failures. Once an alarm is obtained,using the behavior of the abnormality indicators 1 h prior to theanomaly time, we were able to identify thenature of theanomaly[39]. Since network anomalies typically cause deviations from

    http://-/?-http://-/?-
  • 8/10/2019 Anomaly in Manet2

    12/14

    2202 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 51, NO. 8, AUGUST 2003

    Fig. 16. Case (4) runaway process. Average abnormality at the router.

    Fig. 17. Case (4) runaway process: Abnormality indicator of ipIR .

    Fig. 18. Case (4) runaway process: Abnormality indicator of ipIDe .

    Fig. 19. Case (4) runaway process. Abnormality indicator of ipOR .

    Fig.20. Classification of faults usingthe average (over 1 h) of the abnormalityvector. x: file server failures; o: network access problems; star : protocol error;square : runaway process.

    normal behavior in the measured data prior to the actual mani-festation of the failure, it is possible to observe in advance the

    failure signatures corresponding to the impending failure. Theaverage values of the abnormality indicators computed over thisperiod are used to locate the anomaly in the problem space de-fined by the anomaly vectors. As shown in Fig. 20, the fouranomaly types are clustered in different areas of the problemspace. The Euclidean distance between the center vector of thefile server failures and the network access problems is approxi-mately 1.16. The standard deviation for the network file servercluster is 0.43, and that for the network access cluster is 0.07.These results show that the two clusters do not overlap.

    We have limited data on the other two types of anomalies, butit is interesting to note that they are distinct from both file serverfailures and network access problems. In the case of file server

  • 8/10/2019 Anomaly in Manet2

    13/14

    THOTTAN AND JI: ANOMALY DETECTION IN IP NETWORKS 2203

    failures (shown as x in Fig. 20), the abnormality in the ipORand ipIDe variables are much more significant than in ipIR . Onthe contrary, network access problems (shown as o in Fig. 20)are expressed only in the ipIR variable. The fact that these faultswere predicted or detected by the quadratic functional ,which isolates a very narrow region of the problem space, sug-gests that the abnormality in the feature vectors increases as the fault event approaches.

    D. More Related Work

    In our early work, we used duration filter heuristics to obtainreal-timealarmsforanomaly detection in conjunction with MIBvariables [ 40]. In [41] and [42], Bayesian Belief networks wereused together with MIB variables for nonreal time detection of anomalies. Signal processing and statistical approaches are alsogaining more applications in detecting anomalies related to ma-licious events [ 6] and network services such as VoIP [ 43].

    VI. DISCUSSION

    The statistical signal processing tools presented here are ver-satile in their applicability to time series data obtained fromother sources of network information such as probing andpacketfiltering techniques. In [ 5], the authors present the applicationof the methods described above to the interface layer variables.The statistical techniques using the CUSUM approach [ 6] andthe likelihood ratio test have been shown to be applicable to twosources of traffic traces and to two different network topologies,respectively. In this section, we provide a discussion on someissues relevant to using the statistical analysis methods on mea-sured network data.

    It was observed that the use of fine-grained data significantly

    improves detection times since the confidence of statistical anal-ysis techniques is only constrained by sample sizes. For ex-ample, in the work presented here, using MIB variables, a sam-pling frequency of 15 s was used. However, it would be pos-sible to obtain a finer sampling frequency if the polling entitieswere optimally located [ 44]. The primary limiting factor to in-creasing the polling frequency in the case of MIB data is thepriority given to processing SNMP packets by the router beingpolled. Independent of the load on the router, it was observedthat typically there is a 20-ms delay in poll responses.

    In terms of time synchronization, there is a requirement thatthe chosen feature variables are all of the same time granularity.To the best of our knowledge, the problem of simultaneously

    handling multiple feature variables with different time granular-ities is still an open problem. Time synchronization issues alsoarise from the difference in time stamps between the polled en-tity and the polling node. This issue can be resolved using thenetwork time protocol (NTP) MIB, which provides a uniformnetwork-wide time stamp. There is also work done on designingalgorithms to adjust for clock skew [ 45].

    The operator matrix described here can be easily applied toany other system that can be described using time series data.The number of feature variables define theeigenstates of the op-erator matrix. Therefore, if a broader set of anomalies must bedetected, then additional feature vectors must be added. Thus,the scope of the detector is primarily limited by the dimension-

    ality of theoperatormatrix. Theconstructionof theoperatorwillbe similar to the methods described in this paper.

    VII. CONCLUSION AND FUTURE DIRECTIONS

    This paper provides a review of the area of network anomalydetection. Based on the case studies presented, it is clear thatthere is a significant advantage in using the wide array of signalprocessing methods to solve the problem of anomaly detection.A greater synergybetween the networking and signal processingareas will help develop better and more effective tools for de-tecting network anomalies and performance problems. A few of the open issues in the application of statistical analysis methodsto network data are discussed below.

    The change detection approach presented in this paper makesthe assumption that the traffic variables are quasistationary.However, it was observed that some of the MIB variablesexhibit nonstationary behavior. A method to quantify the burstybehavior of the MIB variables could lead to using better modelsfor the traffic and improve the false alarm rates at the variablelevel, thus increasing the optimality of statistical methods.An accurate estimation of the Hurst parameter for the MIBvariables was difficult due to the lack of high-resolution data[46]. Often, the alarms corresponding to anomalous eventshave to check for a persistence criteria to reduce the numberof false alarms. The major reason for false alarms come fromthe abnormality indicators obtained for the bursty variablessuch as ipIR . Increasing the order of the AR model may helpin reducing the false alarm rate, but there is a tradeoff sincethe resolution in detection time would decrease. Another openissue is that not all abrupt changes in MIB data correspond tonetwork anomalies. Thus, the accurate definition of the natureof the abrupt changes corresponding to anomalous events isessential for increased detection accuracy.

    The SNMP protocol runs over the UDP transport mechanismand, therefore, could result in lost SNMP queries and responsesfrom the devices. From the signal processing perspective, thiscould result in missing samples, and it is necessary to designefficient algorithms to deal with missing data.

    From the study presented here, it is clear that signal pro-cessing techniques can add significant advantage to existingnet-work management tools. By improving the capability of pre-dicting impending network failures, it is possible to reduce net-work downtime and increase network reliability. Rigorous sta-tistical analysis can lead to better characterization of evolvingnetwork behavior and eventually lead to more efficient methodsfor both failure and intrusion detection.

    ACKNOWLEDGMENT

    The authors would like to thank D. Hollinger, N. Schimke,R. Collins, and C. Hood for their generous help with the campusdatacollection. Theyalso acknowledge Lucent Technologies forproviding data on the enterprise network and thank K. Vastolafor helpful discussions on the topic.

    REFERENCES[1] W. E. Leland, M. S. Taqqu, W. Willinger, and D. V. Wilson, On the

    self-similar nature of ethernet traffic (extended version), IEEE/ACM Trans. Networking , vol. 2, pp. 115, Feb. 1994.

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/10/2019 Anomaly in Manet2

    14/14

    2204 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 51, NO. 8, AUGUST 2003

    [2] J. L. Vehel, E. Lutton, and C. Tricot, Fractals in Engineering: FromTheory to Industrial Applications . New York: Springer-Verlag, 1997,pp. 185202.

    [3] N. G. Duffield, F. L. Presti, V. Paxson, and D. Towsley, Inferring link loss using striped unicast probes, in Proc. IEEE INFOCOM , 2001.

    [4] M. Coates, A. Hero, R. Nowak, and B. Yu, Internet tomography, IEEE Signal Processing Mag. , vol. 19, pp. 4765, May 2002.

    [5] M. Thottan, Fault detection in ip networks, Ph.D. dissertation, Rens-selaer Polytech. Inst., Troy, NY, 2000. Under patent with RPI.

    [6] H. Wang, D. Zhang, and K. G. Shin, Detecting syn flooding attacks,in Proc. IEEE INFOCOM , 2002.[7] O. Akashi, T. Sugawara,K. Murakami,M. Maruyama,and N. Takahashi,

    Multiagent-based cooperative inter-as diagnosis in encore, in Proc. NOMS , 2000.

    [8] M. Basseville and I. Nikiforov, Detection of Abrupt Changes, Theoryand Application . Englewood Cliffs: Prentice-Hall, 1993.

    [9] T. Ye, S. Kalyanaraman, D. Harrison, B. Sikdar, B. Mo, H. T. Kaur, K.Vastola, and B. Szymanski, Network management and control usingcollaborative on-line simulation, in Proc. CNDSMS , 2000.

    [10] I. Norros, A storage model with self-similar input, Queueing Syst. ,vol. 16, pp. 387396, 1994.

    [11] R. H. Reidi, M. S. Crouse, V. J. Ribeiro, and R. G. Baraniuk, A multi-fractal wavelet model with application to network traffic, IEEE Trans. Inform. Theory , vol. 45, pp. 9921018, Mar. 1999.

    [12] A. C. Gilbert, W. Willinger, and A. Feldmann, Scaling analysis of con-servativecascades with applications to network traffic, IEEE Trans. In-

    form. Theory , vol. 45, pp. 971991, Mar. 1999.[13] M.Thottanand C. Ji,Using networkfaultpredictionsto enableip trafficmanagement, J. Network Syst. Manage. , 2000.

    [14] R. Maxion and F. E. Feather, A case study of ethernet anomalies in adistributed computing environment, IEEE Trans. Reliability , vol. 39,pp. 433443, Oct. 1990.

    [15] G. Vigna and R. A. Kemmerer, Netstat: A network based intrusuiondetection approach, in Proc. ACSAC , 1998.

    [16] J. Yang, P. Ning, X. S. Wang, and S. Jajodia, Cards: A distributedsystem for detecting coordinated attacks, in Proc. SEC , 2000, pp.171180.

    [17] S. Savage, D. Wetherall, A. R. Karlin, and T. Anderson, Practical net-work support for ip traceback, in Proc. ACM SIGCOMM , 2000, pp.295306.

    [18] CAIDA. Cooperative association for internet data analysis. [Online].Available: http://www/caida.org/Tools.

    [19] W. S. Cleveland, D.Lin, andD. X. Sun, Ippacket generation: Statistical

    models for tcp start times based on connection rate super position, inProc. ACM SIGMETRICS , 2000, pp. 166177.[20] R. Caceres, N. G. Duffield, A. Feldmann, J. Friedmann, A. Greenberg,

    R. Greer, T. Johnson, C. Kalmanek, B. Krishnamurthy, D. Lavelle, P.P. Mishra, K. K. Ramakrishnan, J. Rexford, F. True, and J. E. van derMerwe, Measurement and analysis of ip network usage and behavior, IEEE Commun. Mag. , pp. 144151, May 2000.

    [21] P. Aukia, M. Kodialam, P. Koppol, T. Lakshman, H. Sarin, and B. Suter,Rates: A server for mpls traffic engineering, IEEE Network Magazine ,pp. 3441, Mar./Apr. 2000.

    [22] W. Stallings, SNMP, SNMPv2, and CMIP The Practical Guide to Network Management Standards , 5th ed. Wellesley, MA: Ad-dison-Wesley, 1994.

    [23] K. McCloghrie and M. Rose, Management information base for net-work management of tcp/ip-based internets: Mib 2,, RFC1213, 1991.

    [24] M. Rose, The Simple Book: An Introduction to Internet Management ,2nd ed. Englewood Cliffs, NJ: Prentice-Hall, 1996.

    [25] T. D. Ndousse andT.Okuda, Computational intelligencefor distributedfault management in networks using fuzzy cognitive maps, in Proc. IEEE ICC , Dallas, TX, Jun. 1996, pp. 15581562.

    [26] L. Lewis,A case based reasoningapproachto themanagement of faultsin communication networks, in Proc. IEEE INFOCOM , vol. 3, SanFrancisco, CA, Mar. 1993, pp. 14221429.

    [27] A. S. Franceschi, L. F. Kormann, and C. B. Westphall, Performanceevaluation for proactive network management, in Proc. IEEE ICC ,Dallas, TX, June 1996, pp. 2226.

    [28] L. Lewis and G. Dreo, Extending trouble ticket systems to fault diag-nosis, IEEE Network , vol. 7, pp. 4451, Nov. 1993.

    [29] I. Katzela and M. Schwarz, Schemes for fault identification in commu-nication networks, IEEE/ACM Trans. Networking , vol. 3, pp. 753764,Dec. 1995.

    [30] I. Rouvellou and G. Hart, Automatic alarm correlation for faultidentification, in Proc. IEEE INFOCOM , Boston, MA, Apr. 1995, pp.553561.

    [31] A. Bouloutas, G. Hart, and M. Schwartz, On the design of observersforfailure detectionof discrete eventsystems, in Network Management and Control . New York: Plenum, 1990.

    [32] A. Lazar, W. Wang, and R. Deng, Models and algorithms for network fault detection and identification: A review, in Proc. IEEE Int. Contr.Conf. , 1992.

    [33] G. Jakobson and M. D. Weissman, Alarm correlation, IEEE Network ,vol. 7, pp. 5259, Nov. 1993.

    [34] F. Feather and R. Maxion, Fault detection in an ethernet network using

    anomaly signature matching, in Proc. ACM SIGCOMM , vol. 23, SanFrancisco, CA, Sept. 1993, pp. 279288.[35] S. Papavassiliou, M. Pace, A. Zawadzki, and L. Ho, Implementing en-

    hanced network maintenance for transaction access services: Tools andapplications, Proc. IEEE Int. Contr. Conf. , vol. 1, pp. 211215, 2000.

    [36] H. L. V. Trees, Detection, Estimation, and Modulation Theory . NewYork: Wiley, 1971, vol. 1.

    [37] C. Cohen-Tannoudji, B. Diu, and F. Laloe, Quantum Mechanics , 2nded. New York: Wiley, 1977, vol. 1.

    [38] M. Kirby and L. Sirovich, Application of the KarhunenLoeve proce-dure for thecharacterizationof humanfaces, IEEE Trans. Pattern Anal. Machine Intell. , vol. 12, pp. 103108, Jan. 1990.

    [39] M. Thottan, Network fault classification using abnormality vectors, inProc. IFIP/IEEE Conf. Integrated Network Manage. , Seattle, WA, 2001.

    [40] M. Thottan and C. Ji, Proactive anomaly detection using distributedintelligent agents, IEEE Network , vol. 12, pp. 2127, Sept./Oct. 1998.

    [41] C. Hood and C. Ji, Proactive network fault detection, in Proc. IEEE

    INFOCOM , vol. 3, Kobe, Japan, Apr. 1997, pp. 11471155. [Online].Available: http://neuron.ecse.rpi.edu/.[42] C. Hood, Intelligent detection for fault management for communica-

    tion networks, Ph.D. dissertation, Rensselaer Polytech. Inst., Troy, NY,1996.

    [43] M. Mandjes, I. Saniee, S. Stolyar, and R. Schmidt, Load characteriza-tion, overload prediction and load anomaly detection for voice over iptraffic, in Proc. 38th Annu. Allerton Conf. Commun., Contr. Comput. ,Nov. 2000.

    [44] L. Li, M. Thottan, B. Yao, and S. Paul, Distributed network monitoringwith bounded link utilizationin ip networks, in Proc. IEEE Infocom.,2003, to be published.

    [45] L. Zhang, Z. Liu, and C. Xia, Clock synchronization algorithms fornetwork measurements, in Proc. IEEE INFOCOM , 2002.

    [46] M. Wei, Computer communication network traffic: Modeling, analysisand tests, M.S. Thesis, Rensselaer Polytechnic Inst., Troy, NY, 1996.

    Marina Thottan received the Ph.D. degree in elec-trical engineering from Rensselaer Polytechnic Insti-tute, Troy, NY, in 2000.

    She is currently a Member of Technical Staff withthe Networking Software Research Center, Bell Lab-oratories, Holmdel, NJ. Her research interests are innetwork measurements, IP QoS, time series analysis,and signal detection theory.

    Chuanyi Ji received the B.S. degree (with honors)from Tsinghua University, Beijing, China, in 1983,

    the M.S. degree from the University of Pennsylvania,Philadelphia, in 1986, and the Ph.D. degree fromthe California Institute of Technology, Pasadena, in1992, all in electrical engineering.

    She is an Associate Professor a with the De-partment of Electrical and Computer Engineering,Georgia Institute of Technology, Atlanta. She wason the faculty at Rensselaer Polytechnic Institute(RPI), Troy, NY, from 1991 to 2001. She spent her

    sabbatical at Bell LabsLucent, Murray Hill, NJ, in 1999 and was a visitingfaculty member at the Massachusetts Institute of Technology, Cambridge, inFall 2000. Her research lies in the areas of network management and controland adaptive learning systems. Her research interests are in understanding andmanaging complex networks, applications of adaptive learning systems tonetwork management, learning algorithms, statistics, and information theory.

    Dr. Ji received the CAREER award from the National Science Foundation in1995 and an Early Career award from RPI in 2000.