87
Anomaly Detection Systems

Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

  • View
    218

  • Download
    3

Embed Size (px)

Citation preview

Page 1: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

Anomaly Detection Systems

Page 2: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

2/87

Contents

• Statistical methods

• Systems with learning

• Clustering in anomaly detection systems

Page 3: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

3/87

Anomaly detection

• Anomaly detection involves a process of establishing profiles of normal behaviour, comparing actual user/network behaviour to those profiles, and flagging deviations from the normal.

• The basis of anomaly detection is the assertion that abnormal behaviour patterns indicate misuse of systems.

Page 4: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

4/87

Anomaly detection

• Profiles are defined as sets of metrics. Metrics are measures of particular aspects of user behaviour.

• Each metric is associated with a threshold or range of values.

Page 5: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

5/87

Anomaly detection

• Anomaly detection depends on an assumption that users exhibit predictable, consistent patterns of system usage.

• The approach also accommodates adaptations to changes in user behaviour over time.

Page 6: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

6/87

Anomaly detection

• The completeness of anomaly detection has yet to be verified (no one knows whether any given set of metrics is rich enough to express all anomalous behaviour).

Page 7: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

7/87

Statistical methods

• Parametric methods– Analytical approaches in which assumptions

are made about the underlying distribution of the data being analyzed.

– The usual assumption is that the distributions of usage patterns are Gaussian:

2

20

2

2

1

xx

exf

x0 – mean

- standard deviation

Page 8: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

8/87

Statistical methods

• Non-parametric methods– Involve nonparametric data classification

techniques - cluster analysis.

Page 9: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

9/87

Statistical methods

• The Denning’s model (the IDES model for intrusion).– Four statistical models may be included in the

system:• Operational model• Mean and standard deviation model• Multivariate model• Markov process model.

– Each model is suitable for a particular type of system metric.

Page 10: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

10/87

Statistical methods

• Operational model– This model applies to metrics such as event

counters for the number of password failures in a particular time interval.

– The model compares the metric to a set threshold, triggering an anomaly when the metric exceeds the threshold value.

Page 11: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

11/87

Statistical methods

• Mean and standard deviation model– A classical mean and standard deviation

characterization of data. – The assumption is that all the analyzer knows

about system behaviour metrics are the mean and standard deviations.

Page 12: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

12/87

Statistical methods

• Mean and standard deviation model (cont.)– A new behaviour observation is defined to be

abnormal if it falls outside a confidence interval.

– This confidence interval is defined as d standard deviations from the mean for some parameter d (usually d=3).

Page 13: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

13/87

Statistical methods

• Mean and standard deviation model (cont.)– This characterization is applicable to event

counters, interval timers, and resource measures (memory, CPU, etc.)

– It is possible to assign weights to these computations, such that more recent data are assigned greater weights.

Page 14: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

14/87

Statistical methods

• Multivariate model– This is an extension to the mean and

standard deviation model. – It is based on performing correlations among

two or more metrics. – Instead of basing the detection of an anomaly

strictly on one measure, one might base it on the correlation of that measure with another measure.

Page 15: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

15/87

Statistical methods

• Multivariate model (cont.)– Example:

• Instead of detecting an anomaly based solely on the observed length of a session, one might base it on the correlation of the length of the session with the number of CPU cycles utilized.

Page 16: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

16/87

Statistical methods

• Markov process model– Under this model, the detector considers each

different type of audit event as a state variable and uses a state transition matrix to characterize the transition frequencies between states (not the frequencies of the individual states/audit records).

Page 17: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

17/87

Statistical methods• Markov process model (cont.)

– A new observation is defined as anomalous if its probability, as determined by the previous state and value in the state transition matrix, is too low/high.

– This allows the detector to spot unusual command or event sequences, not just single events.

– This introduces the notion of performing stateful analysis of event streams (frequent episodes, etc.)

Page 18: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

18/87

Statistical methods

• Markov process model (cont.) – Example - NIDES (Next-generation Intrusion

Detection Expert System) • Developed by SRI (Stanford Research Institute) in

1990s.• Measures various activity levels.• Combines these into a single “normality” measure.• Checks this against a threshold.• If the measure is above the threshold, the activity

is considered abnormal.

Page 19: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

19/87

Statistical methods

• Markov process model (cont.) – Example - NIDES (cont.)

• NIDES measures– Intensity measures

» An example would be the number of audit records (log entries) generated within a set time interval.

» Several different time intervals are used in order to track short-, medium-, and long-term behaviour.

Page 20: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

20/87

Statistical methods

• Markov process model (cont.) – Example - NIDES (cont.)

• NIDES measures (cont.)– Distribution measures

» The overall distribution of the various audit records (log file entries) is tracked via histograms.

» A difference measure is defined to determine how close a given short-term histogram is to “normal” behaviour.

Page 21: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

21/87

Statistical methods

• Markov process model (cont.) – Example - NIDES (cont.)

• NIDES measures (cont.)– Categorical data

» The names of files accessed or the names of remote computers accessed are examples of categorical data used.

Page 22: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

22/87

Statistical methods

• Markov process model (cont.) – Example - NIDES (cont.)

• NIDES measures (cont.)– Counting measures

» These are numerical values that measure parameters such as the number of seconds of CPU time used.

» They are generally taken over a fixed amount of time or over a specific event, such as a single login.

» Thus, they are similar in character to intensity measures, although they measure a different kind of activity.

Page 23: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

23/87

Statistical methods

• Markov process model (cont.) – Example - NIDES (cont.)

• The different measurements each define a statistic Sj .

• These measurements are assumed (constructed to be) appropriate (this includes normalization), and are combined to produce a 2-like statistic:

n

jjSn

T1

22 1

Page 24: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

24/87

Statistical methods

• Markov process model (cont.) – Example - NIDES (cont.)

• A more complicated measure would include the correlation between the events (as was done with IDES):

• Here, C is the correlation matrix between Si and Sj for all i and j. IS is called the IDES score.

Tnn SSSSIS ,,,, 11

1 C

Page 25: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

25/87

Statistical methods

• Markov process model (cont.) – Example - NIDES (cont.)

• NIDES compares recent activity with past activity, using a methodology that amounts to a sliding window on the past.

• Thus it is designed to detect changes in activity and to adapt to new activity levels.

Page 26: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

26/87

Statistical methods

• Markov process model (cont.) – Example - NIDES (cont.)

• NIDES intensity measures are counts of audit records per time unit etc.

• This provides an overall activity level for the system.

• These are updated continuously rather than recomputed at each time interval.

Page 27: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

27/87

Statistical methods

• Markov process model (cont.) – Example - NIDES (cont.)

• Possible elements that can be monitored with this basic idea:

– Average system load.– Number of active processes.– Number of E-mails received.– Different types of audit records (can be tracked

separately).

Page 28: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

28/87

Statistical methods

• Markov process model (cont.) – Example - NIDES (cont.)

• The obvious extension of the intensity measures idea is to track the different types of audit records.

• This leads to a distribution (histogram) for the audit records.

• Similarly, one could track the sizes of E-mail messages received, or the types of files accessed.

• These can be updated continuously.• Distributions are then compared by means of a

squared error metric.

Page 29: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

29/87

Statistical methods

• Markov process model (cont.) – Example - NIDES (cont.)

• Categorical measures can be for example the names of files accessed.

• They are treated just like distributional measures.• Now each bin corresponds to a categorical, while

with distributional measures the bin can correspond to a range of values.

• The updates are still done continuously.

Page 30: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

30/87

Statistical methods

• Markov process model (cont.) – Example - NIDES (cont.)

• All the measures are combined into the T2 statistic.• The value is compared with a threshold to

determine if the activity is “abnormal”.• The threshold is usually set empirically, based on

the observed network behaviour in some period of time.

Page 31: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

31/87

Statistical methods

• Markov process model (cont.) – Example - NIDES (cont.)

• NIDES produces a single, overall measure of “normality”, which could allow further investigation into the components that make up the statistic upon an alert.

• The problem with this is that an unusually low value for one statistic can mask a high one for another – multifaceted measures are more useful.

Page 32: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

32/87

Statistical methods

• Advantages of parametric approach– Statistical anomaly detection could reveal

interesting, sometimes suspicious, activities that could lead to discoveries of security breaches.

Page 33: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

33/87

Statistical methods

• Advantages of parametric approach (cont.)– Parametric statistical systems do not require

the constant updates and maintenance that misuse detection systems do.

– However, metrics must be well chosen, adequate for good discrimination, and well-adapted to changes in behaviour (that is, changes in user behaviour must produce a consistent, noticeable change in the corresponding metrics).

Page 34: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

34/87

Statistical methods

• Disadvantages of parametric approach– Batch mode processing of audit records,

which eliminates the capability to perform automated responses to block damage.

– Although more recent systems attempt to perform real-time analysis of audit data, the memory and processing loads involved in using and maintaining the user profile knowledge base usually cause the system to lag behind audit record generation.

Page 35: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

35/87

Statistical methods

• Disadvantages of parametric approach (cont.)– The nature of statistical analysis reduces the

capability of taking into account the sequential relationships between events.

– The exact order of the occurrence of events is not provided as an attribute in most of these systems.

Page 36: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

36/87

Statistical methods

• Disadvantages of parametric approach (cont.)– Because many anomalies indicating attack

depend on such sequential event relationships, this situation represents a serious limitation to the approach.

– In cases when quantitative methods (Denning's operational model) are utilized, it is also difficult to select appropriate values for thresholds and ranges.

Page 37: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

37/87

Statistical methods• Disadvantages of parametric approach

(cont.)– The false positive rates associated with

statistical analysis systems are high, which sometimes leads to users ignoring or disabling the systems.

– The false negative rates are also difficult to reduce in these systems.

Page 38: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

38/87

Statistical methods

• Nonparametric measures– One of the problems of parametric methods is

that error rates are high when the assumptions about the distribution are incorrect.

Page 39: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

39/87

Statistical methods

• Nonparametric measures (cont.)– When researchers began collecting

information about system usage patterns that included attributes such as system resource usage, the distributions were discovered not to be normal.

– Then, including normal distribution assumption into the measures led to high error rates.

Page 40: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

40/87

Statistical methods

• Nonparametric measures (cont.)– A way of overcoming these problems is to

utilize nonparametric techniques for performing anomaly detection.

– This approach provides the capability of accommodating users with less predictable usage patterns and allows the analyzer to take into account system measures that are not easily accommodated by parametric schemes.

Page 41: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

41/87

Statistical methods

• Nonparametric measures (cont.)– The nonparametric approach involves

nonparametric data classification techniques, specifically cluster analysis.

– In cluster analysis, large quantities of historical data are collected (a sample set) and organized into clusters according to some evaluation criteria.

Page 42: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

42/87

Statistical methods

• Nonparametric measures (cont.)– Preprocessing is performed in which features

associated with a particular event stream (often mapped to a specific user) are converted into a vector representation (for example, Xi = [f1, f2, ..., fn] in an n-dimensional state).

Page 43: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

43/87

Statistical methods

• Nonparametric measures (cont.)– A clustering algorithm is used to group

vectors into classes by behaviours, attempting to group them so that members of each class are as close as possible to each other while different classes are as far apart as they can be.

Page 44: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

44/87

Statistical methods

• Nonparametric measures (cont.)– In nonparametric statistical anomaly

detection, the premise is that a user's activity data, as expressed in terms of the features, falls into two distinct clusters: one indicating anomalous activity and the other indicating normal activity.

Page 45: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

45/87

Statistical methods• Nonparametric measures (cont.)

– Various clustering algorithms are available. These range from algorithms that use simple distance measures to determine whether an object falls into a cluster, to more complex concept-based measures (in which an object is "scored“ according to a set of conditions and that score is used to determine membership in a particular cluster) .

– Different clustering algorithms usually best serve different data sets and analysis goals.

Page 46: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

46/87

Statistical methods

• Nonparametric measures (cont.)– The advantages of nonparametric approaches

include the capability of performing reliable reduction of event data (in the transformation of raw event data to vectors).

– This effect may reach as high as two orders of magnitude compared to the classical approach that does not include vectors.

Page 47: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

47/87

Statistical methods

• Nonparametric measures (cont.)– Other benefits are improvement in the speed

of detection and improvement in accuracy over parametric statistical analysis.

– Disadvantages involve concerns that expanding features beyond resource usage would reduce the efficiency and the accuracy of the analysis.

Page 48: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

48/87

Systems with learning

• Two phases of system operation:– The learning phase, in which the system is

taught what a normal behaviour is.– The recognition phase, in which the system

classifies the input vectors according to the knowledge acquired in the learning process.

– These systems also include a conversion of raw data into feature vectors.

Page 49: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

49/87

Systems with learning

• Example: Neural networks– Neural networks use adaptive learning

techniques to characterize anomalous behaviour.

– This analysis technique operates on historical sets of training data, which are presumably cleansed of any data indicating intrusions or other undesirable user behaviour.

Page 50: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

50/87

Systems with learning

• Example: Neural networks (cont.)– Neural networks consist of numerous simple

processing elements called neurons that interact by using weighted connections.

– The knowledge of a neural network is encoded in the structure of the net in terms of connections between units and their weights.

– The actual learning process takes place by changing weights and adding or removing connections.

Page 51: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

51/87

Systems with learning

Page 52: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

52/87

Systems with learning

• Example: Neural networks (cont.)– Neural network processing involves two

stages. • In the first stage, the network is populated by a

training set of historical or other sample data that is representative of user behaviour.

• In the second stage, the network accepts event data and compares it to historical behaviour references, determining similarities and differences.

Page 53: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

53/87

Systems with learning

• Example: Neural networks (cont.)– The network indicates that an event is

abnormal by changing the state of the units, changing the weights of connections, adding connections, or removing them.

– The network also modifies its definition of what constitutes a normal event by performing stepwise corrections.

Page 54: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

54/87

Systems with learning

• Example: Neural networks (cont.)– Neural networks don't make prior

assumptions on expected statistical distribution of metrics, so this method retains some of the advantages over classical statistical analysis associated with statistical nonparametric techniques.

Page 55: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

55/87

Systems with learning

• Example: Neural networks (cont.)– Among the problems associated with utilizing

neural networks for intrusion detection is a tendency to form mysterious unstable configurations in which the network fails to learn certain things for no apparent reason.

Page 56: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

56/87

Systems with learning

• Example: Neural networks (cont.)– The major drawback to utilizing neural

networks for intrusion detection is that neural networks don't provide any explanation of the anomalies they find.

Page 57: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

57/87

Systems with learning

• Example: Neural networks (cont.)– This practice impedes the ability of users to

establish accountability or otherwise address the roots of the security problems that allowed the detected intrusion.

– This made neural networks poorly suited to the needs of security managers.

Page 58: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

58/87

Systems with learning• General problems related to all systems

with learning– The problem with all learning-based

approaches is in the fact that the effectiveness of the approach depends on the quality of the training data.

– In learning-based systems, the training data must reflect normal activity for the users of the system.

Page 59: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

59/87

Systems with learning• General problems related to all systems

with learning (cont.)– This approach may not be comprehensive

enough to reflect all possible normal user behaviour patterns.

– This weakness produces a large false positive error rate.

– The error rate is high because if an event does not match the learnt knowledge completely, a false alarm is often generated, although it does not always happen.

Page 60: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

60/87

Clustering in anomaly detection

• Clustering definition:– “Cluster analysis is the art of finding groups in

data”– The aim: group the given objects in such a

way that the objects within a group are mutually similar and at the same time dissimilar from other groups.

Page 61: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

61/87

Clustering in anomaly detection

• Formal definition:– Let P be a set of vectors, whose cardinality

is m, and whose elements are p1,…,pm , of dimensions n1,…,nm , respectively.

– The task: partition, optimizing a partition criterion, the set P into k subsets P1,…,Pk , such that the following holds:

jikjiPP

PPP

ji

k

,,,2,1,,21

P

Page 62: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

62/87

Data pre-processor

Incoming traffic/logs

Activity data

Detection

model(s)

Detection algorithm

Alerts

Decision criteria

Alert filterAction/Report

Clustering!Clustering!

Clustering in anomaly detection

Page 63: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

63/87

Clustering in anomaly detection

• Why should we do clustering instead of supervised learning?– Labelling a large set of samples is often costly.– Very large data sets – train the system with a

large amount of unlabelled data and then label with supervision.

– Track slow changes of patterns in time without supervision – improves performances.

– Smart feature extraction.– Initial exploratory data analysis.

Page 64: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

64/87

Clustering in anomaly detection

• Appropriate cluster analysis algorithms:– Two main classes of clustering algorithms

• Hierarchical• Non-hierarchical (partitioning)

– Hierarchical• Less efficient • More biased results in general

– Non-hierarchical• Results often depend on the initial partition.

Page 65: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

65/87

Clustering in anomaly detection

• Appropriate cluster analysis algorithms (cont.)– A trade-off between correctness and

efficiency of the CA algorithm must be found in order to achieve the real-time operation of an IDS.

– K-means algorithm – could be a good candidate for implementation in IDS.

Page 66: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

66/87

Clustering in anomaly detection

• Appropriate cluster analysis algorithms (cont.)

– An outline of the K-means algorithm1. Initialization: Randomly choose K vectors from the

data set and make them initial cluster centres.

2. Assignment: Assign each vector to its closest centre.

3. Updating: Replace each centre with the mean of its members.

4. Iteration: Repeat steps 2 and 3 until there is no more updating.

Page 67: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

67/87

Clustering in anomaly detection

• K-means algorithm– A local optimization algorithm – hill climbing.– Clustering depends on initial centres, but

this can be overcome in several ways.– Time complexity linear in the number of

input vectors.

Page 68: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

68/87

• Problems to solve– Determine the number of clusters – Determine the appropriate distance measure

Clustering in anomaly detection

Page 69: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

69/87

• Determine the number of clusters– 2 clusters if we want only to tell “abnormal”

from “normal” behaviour.– More complex clustering evaluation

algorithms should be used to detect the number of clusters at which the most compact and separated clusters are obtained.

– Use hierarchical clustering + clustering evaluation algorithms (inefficient).

Clustering in anomaly detection

Page 70: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

70/87

• Determine the appropriate distance measure

– It must be a metric: a,b, d(a,b)0 a,b, d(a,b)=0a=b a,b, d(a,b)=d(b,a) a,b,c, d(a,c)d(a,b)+d(b,c), i.e. the triangle

inequality must hold.

Clustering in anomaly detection

Page 71: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

71/87

• Determine the appropriate distance measure (cont.)

– Typical metrics:• For equal length input vectors – the Minkowski

metric.• For unequal length input vectors – the edit distance

(which is also a metric).

Clustering in anomaly detection

Page 72: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

72/87

• Minkowski metric

qn

i

q

ii yxd

1

,YX

• q=1, Manhattan (city block) distance• q=2, Euclidean distance

Clustering in anomaly detection

Page 73: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

73/87

• Edit distance– Elementary edit operations

• Deletions• Insertions• Substitutions

– Minimum number of elementary edit operations needed to transform one vector into another.

– Computed recursively, by filling the matrix of partial edit distances – edit distance matrix.

– The definition can include constraints.

Clustering in anomaly detection

Page 74: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

74/87

Clustering in anomaly detection• Labelling clusters

– Way to determine which cluster contains normal instances and which contain attacks.

– 1st assumption• Associate the label “normal” with the cluster of the

greatest cardinality.• Fails with massive attacks, as for example the

Syn-flood attack.• Fails with KDD cup data without filtering out the

attacks.

Page 75: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

75/87

Clustering in anomaly detection• To label properly, we need to explore the

structure of the clusters.

• The clustering quality criteria are used, combined with some characteristics of clusters:– Silhouette index– Davies-Bouldin index– Dunn’s index– Clusters’ diameters

Page 76: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

76/87

Clustering in anomaly detection

• Intra-cluster distance– The measure of compactness of a cluster

(complete diameter, average diameter, centroid diameter)

• Inter-cluster distance– The measure of separation between clusters

(single linkage, complete linkage, average linkage, centroid linkage).

Page 77: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

77/87

Example – Davies-Bouldin

• Data set for clustering

• Clustering into L clusters

• Distance between the vectors and

NXXX ,,1

LCC ,,1 C

kX lX

lkd XX ,

Page 78: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

78/87

Example – Davies-Bouldin

• Davies-Bouldin index:

• Inter-cluster distance

• Intra-cluster distance

L

i ji

ji

ji CC

CC

LDB

1 ,max

1

C

iC

ji CC ,

Page 79: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

79/87

Example – Davies-Bouldin

• Intra-cluster distance – Centroid diameter

i

CCk

i C

sd

C ik

iX

X ,

2

ik

iC

ki

C Cs

X

X1

Page 80: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

80/87

Example – Davies-Bouldin

• Inter-cluster distance – Centroid linkage

ji CCji ssdCC ,,

ik

iC

ki

C Cs

X

X1

jk

jC

k

j

CC

sX

X1

Page 81: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

81/87

Clustering in anomaly detection

• A clusters labelling algorithm:– Uses a combination of the Davies-Bouldin

index of the clustering and the centroid diameters of the clusters.

– Two clusters: “normal” and “abnormal”.– Main idea:

• Attack vectors are often mutually very similar, if not identical.

• Consequently, the attack cluster in the case of a massive attack is very compact.

Page 82: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

82/87

Clustering in anomaly detection

– Main idea (continued):• The Davies-Bouldin index of such a clustering is

either zero (non-attack cluster is empty) or very close to zero.

• The expected value of the centroid diameter of the attack cluster is smaller than that of the non-attack cluster.

Page 83: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

83/87

Clustering in anomaly detection

– Main idea (continued):• Small value of the Davies-Bouldin index indicates

the existence of a massive attack• Small value of the centroid diameter indicates the

attack cluster.

Page 84: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

84/87

Clustering in anomaly detection

– Main idea (continued):• A higher value of the Davies-Bouldin index

indicates that no massive attack is taking place.• Then the attack cluster is expected to be less

compact than the non-attack cluster, i.e. its centroid diameter is greater than that of the non-attack cluster (because non-massive attack vectors are very different in general).

• In this case, even the cluster cardinality can be used for proper labelling.

Page 85: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

85/87

Clustering in anomaly detection

• The clusters labelling algorithm:– Input:

• A clustering C of N vectors into 2 clusters, C1 and C2; C1 is the “non-attack” cluster, labelled with “1”.

• The Davies-Bouldin index threshold, DB.

• The centroid diameters difference thresholds, CD1 and CD2.

– Output:• The eventually relabelled input clustering, if

relabelling conditions are met.

Page 86: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

86/87

Clustering in anomaly detection

• The clusters labelling algorithm (cont.):db = DaviesBouldingIndex(C) ;

cd1 = CentroidDiameter(C1) ;

cd2 = CentroidDiameter(C2) ;

if (db==0)&&(cd2==0)

Relabel(C) ;

else if (db>DeltaDB)&&(cd1>(cd2+DeltaCD1))

Relabel(C) ;

else if (db<DeltaDB)&&((cd1+DeltaCD2)<cd2)

Relabel(C);

Page 87: Anomaly Detection Systems. 2/87 Contents Statistical methods Systems with learning Clustering in anomaly detection systems

87/87

Example – KDD cup data base

• Sample size: N=1000

• Number of clusters: 2

Record No.

DB CD1 CD2 Intrusion Good labelling

(K- means)

Relabel

0-1000 1.13 32759.24 7108.57 N N 2

5000-6000 0.96 4344.63 14158.54 N Y 0

7000-8000 0.14 69.6 7096.34 Y-376 N 3

8000-9000 0 25.19 0 Y-1000 N 1