mtpreport

Embed Size (px)

Citation preview

  • 7/31/2019 mtpreport

    1/14

    Single Pass Anomaly Detection

    M.Tech Project First Stage Report

    Submitted in partial fulfillment of the requirements

    for the degree of

    Master of Technology

    by

    Deepak Garg

    Roll No: 05329015

    under the guidance of

    Prof. Om P. Damani

    Kanwal Rekhi School of Information Technology

    Indian Institute of Technology, BombayMumbai

  • 7/31/2019 mtpreport

    2/14

    Acknowledgments

    I am extremely thankful to Prof. Om P. Damani for his guid-

    ance. His constant encouragement has helped me a lot. I would like

    to thank Nakul Aggarwal for his constant help to modifying the

    ADWICE source code.

    Deepak Garg

    I. I. T. Bombay

    July 17th, 2006

    1

  • 7/31/2019 mtpreport

    3/14

    Abstract

    Anomaly detection in networks is detection of deviations from what is considered to benormal and misuse detection detect all the known attack descriptions. Performing anomaly

    detection a learning approach to detect failures and intrusions in a network, intended to capturenovel attacks. ADWICE [1, 2] is an efficient algorithm to detect anomaly, But since it usesdistance based clustering mechanism it suffers from inefficient clustering. We have proposedsome additional density based statistical variables and also proposed to change the cluster inthe form of box pattern, so as to improve the efficiency.

    Keywords:

    Clustering, K-means.

    ADWICE - Anomaly Detection With fast Incremental Clustering.

    BIRCH - Balanced Iterative Reducing and Clustering.

    DBSCAN - Density-Based Algorithm for Discovering Clusters in Large Spatial Databaseswith Noise.

    PHAD - Packet Header Anomaly Detection.

    IDS - Intrusion Detection System.

    1 Introduction

    Network Anomaly refers to network behavior which deviates from normal network behavior. Anoma-lies occurs due to causes like system mis-configuration, implementation bugs, denial of service at-

    tacks, network overload, file server failures, etc. The main detection scheme of most commercialintrusion detection systems is misuse detection, where known bad behaviors (attacks) are encodedinto signatures, and In anomaly detection normal behavior of users or the protected system ismodeled, often using machine learning or data mining techniques rather than given signatures.During detection new data is matched against the normality model, and deviations are marked asanomalies. Since no knowledge of attacks is needed to train the normality model, misuse detectionsystem can not detect unknown attacks, but anomaly detection may detect previously unknownattacks the detection rate is based on the power of AD algorithm.

    The training of the normality model for anomaly detection may be performed by a variety ofdifferent techniques like clustering based, statistics based etc. and many approaches have beenevaluated. I am focusing on clustering technique, Clustering is the method of grouping objects

    into meaningful subclasses so that the members from the same cluster are quite similar, and themembers from different clusters are quite different from each other. Therefore, clustering methodscan be useful for classifying log data using distance and density function and detecting intrusions.

    This report organized as follows: section 2 have survey of different papers and reports, section3 have the results on ADWICE-BOX with grid index, Future work mentioned in section 4 and nextsection for references.

    2

  • 7/31/2019 mtpreport

    4/14

    2 Literature Survey

    This section covers many learning algorithm, and work related to the problem discussed above.

    2.1 PHAD - Packet Header Anomaly Detection

    Packet header anomaly detector (PHAD)[3] is a model trained by attack-free traffic to learn thenormal range of values in each packet header field. To simplify implementation, we require all fieldsto be 1, 2, 3, or 4 bytes. If a field is larger than 4 bytes (for example, the 6 byte Ethernet address),then we split it into smaller fields (two 3-byte fields). We group smaller fields (such as the 1-bitTCP flags) into 1 byte field.

    During training, we would record all values for each field that occur at least once. However, for4-byte fields, which can have up to 232 values, this is impractical for two reasons.

    This requires excessive memory.

    There is normally not enough training data to record all possible values

    Resulting in a model that over fits the data. To solve these problem, we record a reduced setof values either by hashing the field value modulo a constant H, or by clustering them into Ccontiguous ranges.

    For each field, we record the number r of anomalies that occur during the training period. Ananomaly is simply any value which was not previously observed. For hashing, the maximum valueof r is H. For clustering, an anomaly is any value outside of all of the clusters. After observing thevalue, a new cluster of size 1 is formed, and if the number of clusters exceeds C, then the two closestclusters are combined into a single contiguous range. Also for each field, we record the numbern of times that the field was observed. For the Ethernet fields, this is the same as the number

    of packets. For higher level protocols (IP, TCP, UDP, ICMP), n is the number of packets of thattype. Thus, p = r/n is the estimated probability that a given field observation will be anomalous,at least during the training period.

    During testing, we fix the model (n, r, and the list of observed values). When an anomalyoccurs, we assign a field score of t/p, where p = r/n is the estimated probability of observing ananomaly, and t is the time since the previous anomaly in the same field (either in training or earlierin testing). The idea is that events that occur rarely (large t and small p) should receive higheranomaly scores. Finally, we sum up the scores of the anomalous fields (if there is more than one)to assign an anomaly score to the packet.

    Packet score =

    i anomalous fields

    ti/pi =

    i

    tini/ri (1)

    When used a hash function to reduce the model size to conserve memory, and more importantly,to avoid over fitting the training data. We got good performance because the important fields forintrusion detection have a small r, so that hash collision are rare for these fields. However, hashingis a poor way to generalize continuous values such as TTL or IP packet length when the trainingdata is not complete. A better representation would be a set of clusters, or continuous ranges. Forinstance, Instead of listing all possible hashes of the IP packet length (0, 1, 2,..., 999 for r = 1000),we list a set of ranges, such as 28-28, 60-1500, 65532-65535. Then if the number of clusters exceeds

    3

  • 7/31/2019 mtpreport

    5/14

    C, the two closest clusters are merged. The distance between clusters is the smallest differencebetween two cluster elements.

    Negative Points of This Algorithm The choice of C or H value. trade off between over generalizing (small C or H) and over

    fitting the training data (large C or H).

    Weights of all fields are equal. We can improve the performance of this algorithm, when weassign some weight for each fields. like Ethernet address field has higher weight as compareto port address field.

    2.2 ADWICE - Anomaly Detection With fast Incremental Clustering

    ADWICE[2] is distance based algorithm and data should be numeric. Data is therefore assumedto be transformed into numeric format by pre-processing. This algorithm required some input

    parameters as follows.

    M - Denoted by total number of clusters.

    LS - Leaf Size.

    T - Initial Threshold.

    Threshold step for updating of T.

    An ADWICE model consists of a number of clusters, a number of parameters, and a tree index inwhich the leaves contain the clusters. A ADWICE is to store only condensed information (clusterfeature) instead of all data points of a cluster. A cluster feature is a triple CF = (n,S,SS) wheren is the number of data points in the cluster, S is the linear sum of the n data points and SS is the

    square sum of all data points. From now on we represent clusters by cluster features (CF). Thedistance between a data point and a cluster is the Euclidean distance between data point and thecentroid of the cluster while the distance between two clusters is the Euclidean distance betweentheir centroids. two cluster merge according to its CFs.Each cluster of the leaf node must satisfy athreshold requirement (TR) with respect to the threshold value T to allow the cluster to absorb anew data point[2].

    ADWICE used the index, a tree structure where the non-leaf nodes contained one CF for eachchild. summarizing all clusters contained in the child below. Unfortunately the original indexresults in a suboptimal search where the closest cluster is not always found. Although this doesnot decrease processing performance, accuracy suffers. If a cluster included in the normality modelis not found and the test data is normal, an index error results in an erroneous false positive

    and degrades detection quality. Because of this unwanted property a new grid-based index wasdeveloped preserving the adaptability and good performance of ADWICE.

    During learning/adapting the normality model there are three cases in which the nodes of thegrid tree need to be updated[2]:

    If no cluster is close enough to absorb the data point, v is inserted into the model as a newcluster. If there does not exist a leaf subspace in which the new cluster fits, a new leaf iscreated. However, there is no need for any additional updates of the tree, since nodes higherup do not contain any summary of data below.

    4

  • 7/31/2019 mtpreport

    6/14

  • 7/31/2019 mtpreport

    7/14

    Number of clusters dependency is that the value of k is very critical to the clustering result.Degeneracy means that the clustering may end with some empty clusters.

    Y-means[4] is clustering algorithm for intrusion detection. It is expected to automatically par-

    tition a data set into a reasonable number of clusters so as to classify the instances into normalclusters and abnormal clusters. It also overcomes the shortcomings of the K-means algorithm.

    Y-means algorithm, Similar to K-means algorithm, here we are describing this algorithm with the

    Figure 1: Y-Mean (This figure from paper of Y-means [4])

    help of figure 1[4]. it partitions the normalized data into k clusters. The value of k between 1 to n,where n is the total number of instances. The next step is to find whether there are any empty clus-ters. If there are, new clusters will be created to replace these empty clusters; and then instanceswill be re-assigned to existing centers. This iteration will continue until there is no empty cluster.Subsequently, the outliers of clusters will be removed to form new clusters, in which instances aremore similar to each other; and overlapped adjacent clusters will merge into a new cluster. In thisway, the value of k will be determined automatically by splitting or merging clusters. The last stepis to label the clusters according to their populations; that is, if the population ratio of one clusteris above a given threshold, all the instances in the cluster will be classified as normal; otherwise,they are labeled abnormal.

    Positive Points of this algorithm:

    Y-mean algorithm is independent from fixed number of clusters.

    Its remove empty clusters from the model and rebuild again.

    Negative Points of this algorithm:

    Uses distance based measures for all calculations which are known to be less accurate whenclusters with different densities and sizes exists.

    Degeneracy required much time.

    6

  • 7/31/2019 mtpreport

    8/14

    2.4 BIRCH - Balanced Iterative Reducing and Clustering

    BIRCH[5] algorithms take order ofO(N) to form the cluster from the input datasets. It is dividedinto many phases.

    T, that is the size of cluster.

    B, the branching factor of the tree.

    P, the memory size available to this process.

    L, the maximum number of clusters at each leaf node.

    It maintains a binary Tree type Tree structure with each node having maximum of B chides. Allthe clusters are at the leaf nodes of the tree. Now initially the tree is empty and let say T=0,as new data points keep coming, it traverses the tree to find the appropriate leaf node where itcan ?t into, and then it looks for the perfect match in each of the clusters in that leaf node. If it

    can ?t into any of the clusters, then it is inserted there else a new cluster is formed. The fittingof the data point is defined by distance based measure (which can be Manhattan, Euclidean etc)and the cluster statistics are updated after insertion. If formation of cluster increases the leaf childcount by L, then the leaf is split into 2 leaves with a parent above them and clusters are designatedto the appropriate leaf nodes. Also, at sometime if memory cap i.e. P is reached, then the T isincreased so that the cluster sizes are increased and more points can be fitted into each of theclusters, henceforth reducing the cluster count freeing up some memory.

    Positive Points of this algorithm:

    Time complexity is O(n).

    Memory efficient.

    Classification of new data point is easier.

    Negative Points of this algorithm:

    The algorithm is order dependent of data so clustering is not unique, i.e. the clustering resultsdepends upon the order of the data points.

    Uses distance based measures for all calculations which are known to be less accurate whenclusters with different densities and sizes exists.

    Some data points may be classified to wrong clusters because of the limitations of distancebased calculations in measurement.

    Take input parameters.

    Clusters formed are spherical, may lead to large false positives. We describe more about thisproblem in next subsection.

    7

  • 7/31/2019 mtpreport

    9/14

    2.5 DBSCAN - Density-Based Algorithm for Discovering Clusters in Large

    Spatial Databases with Noise

    DBSCAN[6] an O(NlogN) time clustering algorithm. It is based on density and distance instead

    of only distance. It just iterates once over all the data points for all the clusters with addition timeof log(N) in each step makes it an O(NlogN) algorithm. It takes two input parameters as follows.

    Eps - which is the measure of distance, within which one should look at for finding itsneighbors.

    Minpts - which define the number of points which must lie within a Eps-neighborhood of apoint for it to be core-point.

    For each of the points, first find the Eps-neighborhood of that point (this step takes O(log N ) time,using efficient R* trees), then if there exists more than Minpts data points within this region, thenthis cluster is named as a cluster of its own and assigned a new cluster ID. One may think here

    that each of the points which has more than Minpts data points within its neighborhood would bea separate cluster, this is not true though. Since, paper also defines the merging mechanism for theclusters which should form one cluster i.e. the density reachable and density connected for a pair ofclusters. In the step where the neighborhood is ?nd and cluster ID is assigned, one more loop is ranfor each of the points in the neighborhood of this point, where, checking is done if they also formtheir own clusters, if yes, then clusters are merged based on definition of reachability. Now, thisstep goes into recursion (each of the merged cluster points, checks for their points and hence theircluster possibilities) and using the definition of connectedness, clusters are kept merging unless nomore clusters can be merged.

    Positive Points of this algorithm:

    This is density based algorithm for clustering, which is more accurate than distance basedalgorithms.

    Algorithm results in unique clustering results.

    Is able to detect the clusters of any size and shapes.

    Negative Points of this algorithm:

    It also takes input parameters.

    It is not capable for differentiating clusters with different sizes and different densities sincethe Eps is predefined and fixed all the time.

    8

  • 7/31/2019 mtpreport

    10/14

    3 Experiment On ADWICE

    ADWICE[2] uses BIRCH as the clustering algorithm for learning the normal data and then classi-

    fying the new data as anomalous or normal. BIRCH suffers from a lot of shortcomings. Here wetried to reduce the number of false positives by modifying the threshold calculation, cluster boundsand cluster structure by modify the cluster feature (CF). In original algorithm of BIRCH uses aconstant threshold(T) same for each of the clusters which increases whenever number of clustersreach to maximum number of clusters and merge nearest cluster according to T.

    Fixing the same threshold for all clusters is unfair for many of them. For example considera cluster, with all points near the center of cluster and cluster?s threshold T. This cluster caninclude some of the bad points which are near the boundary. Hence, fixing the same thresholdfor all clusters is not ?ne rather it should depend on the cluster properties like points distribution,density of the cluster etc.

    BIRCH uses distance based measures for clustering algorithm. According to which, all clustershave the same threshold size, T. For a new point inclusion into a cluster, its distance from the

    center of the cluster has to be less than T. So, define inclusion region as the spherical region ofradius T around the center of cluster. Currently, inclusion region is independent of the currentdensity of the cluster and same for all clusters. But if, a cluster is dense, inclusion region shouldbe less and should be dependent on the current radius of the cluster rather than some predefinedfixed threshold. While for sparse cluster, inclusion region should be relatively large.

    So, the inclusion of the new point in a cluster should be dependent on the density of the cluster(i.e. the number of points in cluster and its current radius). Mathematically, the measurementswill be made on the basis of two more variables t and R where both the terms has been explainedbelow[7].

    R (additional statistical variable need to be stored with each cluster Cluster feature set) isdifferent for each of the clusters and depends on the current number of points in it and its

    current Radius (R(CFi)

    R(CFi) = R(CFi) (1 + c/fn(n, d)) (2)

    d = dimension of the data points.n = number of points inside this cluster.fn(n, d) = some function ofn and dc = some constant.i.e. R = its current radius + current radius multiplied by some constant and divided by somefunction ofn.

    The function fn can be logd

    (n) or just log(n). So, threshold requirement should be

    R(CFi) R(CFi) (3)

    But using above expression as measure, clustering will suffer in the case of one or very fewpoints in the cluster, hence define t as the threshold for handling the base cases. (this canbe kept fairly small). So, threshold requirement becomes

    R(CFi) max(R(CFi), t

    ) (4)

    9

  • 7/31/2019 mtpreport

    11/14

    Also, for large sparse clusters, we want an upper bound on the radius of the cluster so as toprevent explosion by some of the clusters.So, threshold requirement in ADWICE-TRAD[7] would be

    R(CFi) min(max(R(CFi), t

    ), T) (5)

    Figure 2: Actual vs Detected Anomaly

    FP Rate = FP/(FP + TN)Detection Rate = TP/(TP + FN)

    We have made another change in ADWICE-GRID. The ADWICE-GRID algorithm used tomake cluster patterns like circle in two dimension, sphere in three dimension and so on. We havechanged it to patterns like rectangle in two dimension and cube in three dimension and so on.When clusters pattern are in the form of circle or sphere they tend to have some anomaly region.we can remove some part of this region with the help of making the cluster in the form of BOX.Following are the example in two dimensions.

    In this example (figure 3) all the points are nearer to each other except few points. When wetry to form a cluster then some anomalous region also covered by this cluster; Because the centerof the cluster is nearer to those points witch are nearer to each other and some points far awayfrom this center. Cluster includes all the points on the basis of the radius. So some anomalousregion is also cover by this cluster. we can remove some part of this anomalous region by makingthe cluster in the form of box. so new name of this algorithm is ADWICE-BOX.

    We also need to modify the cluster feature (CF). Previously CF used to a store three valueswhich are number of points, linear sum of the points and square sum of all the points. We modifiedthe CF, to store two other parameter min and max value for each dimensions. When new pointcomes, compare min and max value in each dimension and update the same for merging of twoclusters. Following are brief steps in new ADWICE with BOX algorithm.

    Calculate new center of the cluster.Center of Cluster = (min + max)/2.

    10

  • 7/31/2019 mtpreport

    12/14

    Figure 3: Box Cluster

    Calculate Radius Vector instead of radius for cluster.Radius of Cluster = (max - min)/2.

    when new point arrive or two clusters merge, find the closest cluster basis on the radius vectorinstead of single radius of the cluster.

    3.1 Results

    In experiment of ADWICE-BOX with grid index, evaluation of detection quality with ADWICE-grid[2] index using KDDCUP99. Our algorithm needs only normal dataset for training. KDDtraining data set of 972781 session records was used to train our model. This model was evaluatedon the KDD testing data set consisting of 311029 session records of normal data and many instancesfrom different attacks type. we have performed this experiment on different number of clusters like5K, 8K, 12K, 15k, 18K, 20k and 30k. our results are found to be better than ADWICE with gridbut with little difference. figure 4 gives the comparison between both the algorithm with maximumno of cluster is 12K.

    Refer appendix-A for modified methods of ADWICE code.

    4 Future Work

    We have modified ADWICE with grid to ADWICE-BOX with grid, and we would concentrate onfollowing points.

    Proper parameters (maximum number of clusters, threshold etc) setting is important toADWICE-BOX efficiency. We will concentrate on more reasonable ways to increasing thethreshold dynamically and make corresponding changes in algorithm so that this algorithmwill be independent from the the maximum number of clusters pattern.

    11

  • 7/31/2019 mtpreport

    13/14

    We will also concentrate on learning from input data set, so it will not depend on the orderof input data.

    Reduce the running time complexity and false positive as much as possible.

    Figure 4: Results of ADWICE-BOX

    References

    [1] Kalle Burbeck and Simin Nadjm-Tehrani. Adwice : Anomaly detection with real-time incremen-tal clustering. In Seongtaek Chee Choonsik Park, editor, Lecture Notes in Computer Science,pages 407 424. Springer Berlin / Heidelberg, jan 2005.

    [2] Kalle Burbeck and Simin Nadjm-Tehrani. Adaptive real-time anomaly detection with improvedindex and ability to forget. In ICDCSW 05: Proceedings of the Second International Work-shop on Security in Distributed Computing Systems (SDCS) (ICDCSW05), pages 195202,Washington, DC, USA, 2005. IEEE Computer Society.

    [3] Matthew V. Mahoney. Network traffic anomaly detection based on packet bytes. In SAC 03:Proceedings of the 2003 ACM symposium on Applied computing, pages 346350, New York, NY,USA, 2003. ACM Press.

    12

  • 7/31/2019 mtpreport

    14/14

    [4] Yu Guan, Nabil Belacel, and Ali A. Ghorbani. Y-means: a clustering method for intrusiondetection. In ICDCSW 05: Proceedings of the Second International Workshop on Security inDistributed Computing Systems (SDCS) (ICDCSW05), Canada, may 2003. Canadian Confer-

    ence on Electrical and Computer Engineering (CCECE-2003).

    [5] Tian Zhang, Raghu Ramakrishnan, and Miron Livny. Birch: an efficient data clustering methodfor very large databases. In SIGMOD 96: Proceedings of the 1996 ACM SIGMOD internationalconference on Management of data, pages 103114, New York, NY, USA, 1996. ACM Press.

    [6] Martin Ester, Hans-Peter Kriegel, Jorg Sander, and Xiaowei Xu. A density-based algorithmfor discovering clusters in large spatial databases with noise. In Evangelos Simoudis, JiaweiHan, and Usama Fayyad, editors, Second International Conference on Knowledge Discoveryand Data Mining, pages 226231, Portland, Oregon, 1996. AAAI Press.

    [7] Nakul Aggrawal. Improving the efficiency of network intrusion detection systems. Technicalreport, India, jun 2006. BTech dissertation.

    [8] Simin Nadjm-Tehrani. Source code of adwice. http://www.ida.liu.se/~snt/research/adwice/,2005.

    13