[IEEE SOUTHEASTCON 2012 - Orlando, FL, USA (2012.03.15-2012.03.18)] 2012 Proceedings of IEEE Southeastcon - Relevance feature selection with data cleaning for intrusion detection system

Relevance Feature Selection with Data Cleaning for Intrusion Detection System

Shan Suthaharan Department of Computer Science

University of North Carolina at Greensboro Greensboro, USA

[email protected]

Tejaswi Panchagnula Department of Computer Science

University of North Carolina at Greensboro Greensboro, USA

[email protected]

Abstract—Labeled datasets play a major role in the process of validating and evaluating machine learning techniques in intrusion detection systems. In order to obtain good accuracy in the evaluation, very large datasets should be considered. Intrusion traffic and normal traffic are in general dependent on a large number of network characteristics called features. However not all of these features contribute to the traffic characteristics. Therefore, eliminating the non-contributing features from the datasets, to facilitate speed and accuracy to the evaluation of machine learning techniques, becomes an important requirement. In this paper we suggest an approach which analyzes the intrusion datasets, evaluates the features for its relevance to a specific attack, determines the level of contribution of feature, and eliminates it from the dataset automatically. We adopt the Rough Set Theory (RST) based approach and select relevance features using multidimensional scatter-plot automatically. A pair-wise feature selection process is adopted to simplify. In our previous research we used KDD’99 dataset and validated the RST based approach. There are lots of redundant data entries in KDD’99 and thus the machine learning techniques are biased towards most occurring events. This property leads the algorithms to ignore less frequent events which can be more harmful than most occurring events. False positives are another important drawback in KDD’99 dataset. In this paper, we adopt NSL-KDD dataset (an improved version of KDD’99 dataset) and validate the automated RST based approach. The approach presented in this paper leads to a selection of most relevance features and we expect that the intrusion detection research using KDD’99-based datasets will benefit from the good understanding of network features and their influences to attacks.

Keywords-intrusion detection; Rough Set Theory; labeled datasets; NSL-KDD dataset; relevance feature selection

I. INTRODUCTION The modern life of human society cannot be fulfilled

without the Internet that is highly secure. In the Internet, massive amount of information are transmitted not securely every instance. The users of the Internet store information at remote locations, share information, carry out remote computations and use many more features provided by the internet technology without knowing the threat to these activities. In recent years unauthorized access to remote information has increased significantly and it became a threat to many organizations that deal with both secure and unsecure data using emerging technologies, like Cloud computing.

Although the attack prevention technologies have been developed to eliminate such unauthorized acts in the Internet, the attacks have still been increased. Therefore it is important for the organizations to deploy preventive measures to protect their data and their network as a whole. Although several preventive measures, such as firewalls and authentication software solutions are available, still strong intrusion attacks are emerging and taking place actively in the current networks. Therefore it is important to have additional measures, like the intrusion detection system (IDS) that employs sophisticated mathematical and statistical techniques and tools.

Intrusion can take place as an attack to a single-host. In this case the IDS should perform intrusion detection activities on the host information alone and such system is called host-based IDS. On the other hand the intrusion can take place as an attack to all or some of the hosts in a network and in this case the IDS should collect information from the entire network and detect intrusions, and such system is called network-based IDS. In this paper we deal with the network-based IDS and it is a collection of software and hardware tool that detect activities which break the security mechanisms related to confidentiality, integrity of data, and the user access control. Although the IDSs can detect the intrusions alone, it is understandable that its effectiveness can be improved with the help of preventive measures such as firewalls and user authentication protocols [1]. Hence, the IDS should be built as a devices/software with mathematical and statistical techniques that collect network traffic information, read and interpret the information, and detect intrusion attacks automatically. Subsequently it should report the intrusion activities and the threat level to the system administrators and network users as appropriate.

We focus on the network-based IDS only and therefore we are interested in monitoring changes in the traffic patterns and detect intrusions based on the anomaly detection models. In anomaly detection models it is assumed that the majority of the network traffic is normal [1-3]. Hence it classifies the traffic based on the behavior of the traffic into normal and abnormal traffic. The traffic that behaves with high deviation from the normal traffic is detected as intrusion. Thus it is capable of detecting new attacks. In the network-based IDS, each intrusion attack depends on many features (i.e. network characteristics) because of the distributed nature of the network communication. Hence the intrusion data is multidimensional

978-1-4673-1375-9/12/$31.00 ©2012 IEEE

(multiple features) and it makes the intrusion detection a very complex task. The interesting observation is that not necessarily all the features are contributing features to intrusion characteristics. Therefore the question is that can we detect intrusion using a fewer number of features? In this paper we address this problem and develop a relevance feature selection mechanism with data cleaning approach, and facilitate the network-based IDS.

One problem in the current intrusion detection techniques is that it needs significant human interaction for the purpose of labeling attacks and normal traffic behavior to support machine learning techniques. However the biggest challenge is the “Big-Data” problem associated with the large amount of network traffic data collected dynamically in the intrusion detection dataset. In addition to the amount of traffic data that the IDS has to deal with, it needs to tackle the multi dimensional data generated based on the features that represent network characteristics [4]. It is important to note that, although the intrusion dataset is huge, it has many redundant records and false positives. To clean such dataset and generate a reliable dataset requires strenuous human interaction, which is time consuming, expensive and also undesirable. However it is widely known that the data mining and machine learning techniques can minimize or eliminate such a human effort. With the data mining technique we can extract a suitable subset from the intrusion datasets and input it to the IDS. Similarly with the machine learning technique we can detect patterns in the subset, which is small in general, and apply to the entire dataset with confident. There are several intrusion detection datasets available publically, however, we selected KDD’99 and NSL-KDD, and used them to demonstrate the effectiveness of the proposed relevance feature selection approach.

II. LABELED DATASETS

A. KDD’99 Dataset The Lincoln Labs [5] developed the KDD’99 dataset and

made them publically available to the research community that works on intrusion detection problems. This dataset has been extensively used by the research conducted during the last decade. The full dataset (100%) is very large and has many feature variables. Hence, to improve computation of machine learning techniques, 10% dataset was extracted and adopted as a training dataset in the intrusion detection research [6-9].

In KDD’99 dataset four groups of attacks can be found and they are known as Denial of Service (DOS) attack, Probing attack, Remote to Local (R2L) attack and User to Root (U2R) attack. Within each group, different types of attack records are presented. For example, in the DOS attack group, records of six different types of attacks namely, back, land, neptune, pod, smurf and teardrop attacks, are presented. Among them, the back attack has received significant attention in the intrusion detection research and several relevance feature selection techniques have been developed [6-9]. The KDD’99 dataset also has 2,203 records of back attacks and each of these records displays the attack levels of 41 features that describe traffic characteristics of a computer network related to the back attack. Similarly there are 97,277 records of normal traffic data with 41 feature values associated with the normal traffic characteristics of the computer network.

From the previous research that used 10% KDD’99 dataset, it is clear (although there are many features) not many features contribute to the attacks presented in the dataset [8]. Therefore the selection of relevance features, to improve computational efficiency of IDS, became an important research task. Significant research has been done to select relevance features for back attack using KDD’99 dataset but none of them provides strong results to determine the accurate relevance features. For example, Olusola et al. [8] conducted the research for selecting relevance features for back attack. They used the Rough Set Theory-based Degree of Dependency principles together with the 10% KDD’99 dataset. Their finding is that features 5 and 6 are the relevance features for back attack. Similarly Kayacik et al. [9] conducted research to select relevance features using Information Gain approach and analyzed it using 10% KDD’99 dataset. Their finding is that features 10 and 13 significantly contributing to the back attack and thus they are the relevance features for back attack. These conflicting results are the main motivation to our research.

The important contributor to these conflicting results is the imperfect aspect of KDD’99 dataset. It has many redundant records with false positives and anomalies. To eliminate some of the problems exist in KDD’99 dataset a newly refined dataset called NSL-KDD has been developed by [10]. It is a significantly improved version of the original KDD’99 dataset.

B. NSL-KDD Dataset NSL-KDD dataset, available at [10], was developed to

overcome some of the obvious problems of the KDD’99 dataset. Mainly, the redundancy problem in KDD’99 dataset is reduced significantly in the NSL-KDD dataset and thus the size of the dataset is reduced. This makes the validation and evaluation of the learning algorithms much easier and stronger. The NSL-KDD also has the same four groups of attacks however there are only 956 back attack records and 67,343 normal records. It is a significant reduction in the size of the dataset. As stated in [10], the learning algorithms are generally biased towards frequently occurring records, however the elimination of redundant records in the NSL-KDD dataset helps to reduce this problem. The NSL-KDD dataset still suffers from several other problems including the existence of false positives, the need for user interaction, inaccuracy and computational issues due to noncontributing features.

Several advantages of NSL-KDD have been reported at [10] among them the following are relevant to this research:

The removal of redundant records in the training subset of NSL-KDD dataset helps our Rough Set Theory (RST) [11, 12] based classifiers to be unbiased towards both the frequent and not frequent records.

The reduction of the NSL-KDD training dataset, by the removal of duplicate records, helps the experiments to run fast and affordable, using the entire NSL-KDD dataset. Therefore it is expected that the RST based feature selection approach will provide us a consistent results and lead to a suitable evaluation platform.

Non existence of duplicate records helps the performance of the classifiers to be unbiased towards different methods and will have better intrusion detection rate.

Some of the statistical information of the NSL-KDD dataset is presented in [10]. We are only interested in the statistical properties of the back attack and the normal traffic with respect to contributing features. The statistical mean and the standard deviation of the back attack for feature 32 are 135.80 and 82.05. Similarly the statistical mean and the standard deviation of the normal traffic data for feature 32 are 125.97 and 96.26. The statistical mean shows that the back attack can be distinguished from the normal traffic using the higher magnitude values. However there are many values that lead to unpredictability between back attack and normal traffic as shown in figure 1, related to feature 32 in the NSL-KDD data:

Figure 1. Classification problem of back attack and normal traffic data

In this figure the ‘blue’ points represent the magnitude of normal traffic data (subset) and the ‘red’ points represent the magnitude of the back attack data. This figure clearly shows the difficulty of distinguishing back attack from normal and defines a complex classification problem.

III. PROPOSED APPROACH AND RESULTS In this section we presented a RST-based relevance feature

selection approach and the results obtained using NSL-KDD dataset. We adopted the original RST-based approach in [8] and the refined RST-based approach in [4]. We also integrated our data cleaning approach to enhance the technique. Suppose A1, A2, …, Am represent m possible attacks in a computer network, they depend on some or all of the network characteristics (features) represented by F1, F2, … Fn and N1 represents the normal traffic. Then the proposed approach to select relevance features has the following steps:

Step 1: Attack Selection:

In this step we select an attack that we want to analyze along with normal traffic data. As an example we select the “back attack” and the normal data from the NSL-KDD dataset for our experiment.

Step 2: Initial Cleaning:

In this step the dataset goes through an initial cleaning process to eliminate the duplicate records. However we use the NSL-KDD dataset which has already been cleaned. Therefore in our experiment this step is not required, however, if the original dataset like KDD’99 is used this cleaning is important.

Step 3: Feature Selection:

Now we select the features based on the suspicion of relevance and previous observations. For example if we suspect or observe that the attack Ai depends on the features Fj, Fk and Fl then we select those features for the experiment. In our experiment, using NSL-KDD, we select features 5 (source address), 6 (destination address) and 32 (destination count) to determine if they are the relevance features for the back attack.

Step 4: Further Cleaning:

Because we reduced the number of features from 41 to 3 for further testing, new duplicates are introduced in the sub-dataset. Therefore these duplicates must be removed. An example of NSL-KDD records (9 records) for features 5, 6 and 32 are presented, after cleaning the duplicates:

TABLE I. AN EXAMPLE OF NSL-KDD DATA WITH FEATURES 5, 6 & 32

Feature 5 Feature 6 Feature 32 Type

232 8153 30 Normal

199 420 255 Normal

287 2251 8 Normal

300 13788 91 Normal

54540 8314 205 Back

54540 8314 26 Back

40636 2920 255 Back

This newly cleaned dataset, with features 5, 6 and 32, has

316 back attack records and 49,629 normal traffic records. We standardized this dataset using the following standard score formula to represent a record with respect to all other records:

F1=(F1-mean(F1))/std(F1); F2=(F2-mean(F2))/std(F2); F3=(F3-mean(F3))/std(F3);

Where, the matrices F1, F2 and F3 represent the values of

features 5, 6, and 32. The corresponding standardized values of the data presented in Table I are given in Table II:

TABLE II. STANDARDIZED SCORES OF THE DATA IN TABLE I

Feature 5 Feature 6 Feature 32 Type

-0.0282 0.0351 -0.9970 Normal

-0.0282 -0.0714 1.3405 Normal

-0.0280 -0.0462 -1.2255 Normal

-0.0280 0.1127 -0.3632 Normal

0.3180 0.3305 0.8434 Back

0.3180 0.3305 -1.3382 Back

-1.4425 -3.2210 1.4527 Back

0 50 100 150 200 250 300 3500

50

100

150

200

250

300

feature 32

mag

nitu

de

Step 5: Data Visualization:

In this step a subset of NSL-KDD is plotted on a 3D space so that the user input is allowed to provide permission to process the data. If the user sees that the further processing is not required on these features a new 3 features may be selected. The plots of normal and “back attack” standard scores with features 5, 6 and 32 are given in figures 2 and 3.

Figure 2. Standardized score of back attack data

Figure 3. Standardized score of normal traffic data

These figures show there are significant numbers of outliers in both back attack and normal traffic, and therefore a suitable data cleaning approach should be adopted before sampling.

Step 6: Shape Transformation:

The observed outliers in the above graphs could be confused with anomalies in both normal and back traffic behavior and could lead to abnormal results in RST-based intrusion detection. One of the useful approaches to detect anomalies is the data approximation technique, but the data approximation technique is more effective if the data displays symmetric geometric patterns. It has been recently shown that

the ellipsoid-based data approximation techniques are effective in detecting anomalies [13]. However the above plots show no symmetric and suitable ellipsoid shapes.

Therefore we suggest the following mathematical (linear) transformation to incorporate the properties [14] of a Gaussian distribution application in order to obtain an ellipsoid shape as a boundary for the standardized scores.

T1=G1.*F1+G2.*F2+G3.*F3; T2=G2.*F1+G3.*F2+G1.*F3; T3=G3.*F1+G1.*F2+G2.*F3;

Where, the matrices F1, F2 and F3 are values of the feature

variables in consideration. The matrices G1, G2 and G3 have pseudo random numbers that follow normal (Gaussian) distributions. An example of the Gaussian values used for these variables are given in figure 4:

Figure 4. An example of Gaussian data used

This transformation gives the changes shown in figure 5 to the plot of 5, 6 and 32 above. The main assumption is that the effect of outliers on the Gaussian distribution is noticeably different from that of normal data on the Gaussian distribution.

Figure 5. Transfromed “back attack” standard scores

-6-4

-20

2

-6-4

-2

02

-2

-1

0

1

2

feature 5

Standardized Data

feature 6

feat

ure

32

-4-2

02

4

-4-2

0

24

-4

-2

0

2

4

for feature 5

Gaussian distribution Data

for feature 6

for f

eatu

re 3

2

-10-5

05

10

-50

5

1015

-10

-5

0

5

10

feature 5

Transformed Data

feature 6

feat

ure

32

Step 7: Further Cleaning:

It is now easier to eliminate the anomalies using an ellipsoid-based data approximation technique. To eliminate outliers, the Gaussian distribution properties and the Singular Value Decomposition technique are used. The result of an ellipsoid-based approach is shown in figure 6.

Figure 6. Elliptical boundary for “back attack” standard score

Step 8: Random Sampling:

The purpose of the random sampling is to select equal number of back attack record and normal record to be unbiased in the influence of features in both types (back and normal) of records. The random sampling will be applied to the standard scores of the records falling inside the ellipse. These scores will be discretized. The discrete data will then be used with different sample sizes in RST-based approach in order to select relevance features. The sample sizes that were used are:

10 back attack and 10 normal 20 back attack and 20 normal 30 back attack and 30 normal 50 back attack and 50 normal 100 back attack and 100 normal 150 back attack 150 normal 200 back attack and 200 normal 279 back attack and 279 normal.

Step 9: Discretization of Data:

Now we have a thoroughly cleaned dataset and, with 95% confidence, we can use the samples obtained inside ellipse. These samples are saved in an Information System file and discretized using the ROSE2, a rough set explorer [15-17]. The data was discretized in two intervals and they are: discrete over intervals 0 and 1 in the first case and over a range 0, 1 and 2 in the second case. However it can be extended to more intervals as needed. This discretization process gives the decision tables presented in Table III and Table IV. The Table III shows 10 samples of back and 10 samples of normal with 2 discrete values, 0 and 1 and Table IV shows 10 samples of back and 10 samples of normal with 3 discrete values, 0 and 1 and 2.

TABLE III. CLASSIFICATION OF OBJECTS OVER 2 INTERVALS

Objects Feature 5 Feature 6 Feature 32 Decision

1 0 0 0 1

2 0 0 1 1

3-12, 14-20 0 1 1 1

13 0 1 0 2

TABLE IV. CLASSIFICATION OF OBJECTS 3 INTERVALS

Objects Feature 5 Feature 6 Feature 32 Decision

1 1 2 1 1

2-6, 8, 9, 11-19 1 1 1 1

7 0 1 1 1

10 1 1 0 1

20 2 1 1 2

Step 10: RST Application:

In this step we apply Rough Set Theory to the discretized dataset to make decision that a particular objects contribute to the definite decision of back attack. The RST allows us to classify the objects and make the appropriate decision. It is a useful tool in our case because the data that we deal with are imperfect. According to RST we first find the indiscernible set which consists of the objects that have similar information and then calculate lower approximation. The lower approximation set gives us the objects that definitely have the information which allow us to make the most accurate decision. Although the RST defines lower approximation and upper approximation sets, we only adopt the lower approximation. Our goal is to use the RST to classify the objects, find the lower approximation and calculate the degree of dependency to decide which features are the relevance features for back attack.

Step 11: Degree of Dependency:

Incorporating the effect of sample size in determining the degree of dependency requires modification to the actual formula of the final calculation of the degree of dependency. We previously proposed a change to RST in which the sample size was integrated [4]. The change in RST resulted in the following equation:

n

i

n

i

iS

iS

0

0

)(

DoDs(i)*)(

It is a weighted average of DoDs of all the samples, where the weights are the actual sample sizes.

Step 12: DoD Calculation:

The sample data was then loaded into a program written in MATLAB and the Degree of Dependency was obtained. Degree of dependencies for various sample sizes was

-10 -8 -6 -4 -2 0 2 4 6 8 10-6

-4

-2

0

2

4

6

8

10

12

feature 5

feat

ure

6

considered for both data subsets with discrete values over 2 and 3 intervals. In the Table III and Table IV, the Objects are the data in the indiscernibility subsets obtained from approximations. “Decision” is the Decision attribute which shows 1 as a back attack and 2 as a normal traffic. As we can see in Table III, the lower approximation contains 19 objects (1, 2, 3-12 and 14-20) out of the total of 20. Hence the degree of dependency from the sample size of 10 back attack and 10 normal is 0.95. Similarly the DoD for the “interval 3” data in Table IV is also 0.95.

Step 13: Final Decision:

The final decision is based on the weighted average degree of dependency calculated based on the sample size-dependent degree of dependency introduced previously. The DoDs of different sample sizes are calculated and presented in Table V and all of these values show very high dependency. The average DoDs as calculated from samples of all sizes over both 2 and 3 intervals of discrete values is as follows:

2 intervals: Average DoD = 0.9922

3 intervals: Average DoD = 0.9867

The change in the total sample size as opposed to the total number of back attack data is due to the data approximation and cleaning. Hence, from the total of 956 records of back attack, after cleaning, a total of 279 original and unique data was obtained for the proposed approach. Rough set theory was applied on this data to obtain the results as observed.

TABLE V. DOD VALUES WITH RESPECT TO SAMPLE SIZE

Samples S(i) DoD for the respective sample size over 2

intervals of discrete values

DoD for the respective sample size over 3 intervals

of discrete values

10 0.95 0.95

20 0.98 0.95

30 1.00 0.97

50 1.00 0.98

100 0.99 0.99

150 0.99 0.99

200 0.99 0.99

279 0.99 0.99

IV. CONCLUSION AND FUTURE WORK From the proposed approach, we concluded that features 5,

6 and 32 are relevant towards identifying a back attack. However, the data from cleaning was obtained by modeling 3 ellipses in 2-dimensional space and then comparing all the data that fall within the ellipses to obtain data in the 3-dimensions. A 3-dimensional ellipsoid can be modeled to obtain data from the entire subset of data. Also, the values could be discretized

over multiple intervals to see the results from the same proposed approach. The method can be applied to identify relevance features for all the attacks in a dataset.

To enhance the analysis we will capture data from a real network and apply the proposed approach on the captured data as opposed to applying the approach on data that was earlier captured from simulated network environments such as the KDD ’99 and NSL-KDD datasets. This proposed approach promises better accuracy and less computation time than the Intrusion Detection Systems in use today. Future work could be to come up with a robust approach to eliminate outliers in the data than using a random seed value.

REFERENCES [1] A. S. Ashoor and S. Gore, “Importance of Intrusion Detection system

(IDS)”. International Journal of Scientific and Engineering Research, vol. 2, no. 1, pp.1-4, Jan-2011.

[2] M. Sheikhan and M. S. Rad, “Misuse Detection Based on Feature Selection by Fuzzy Association Rule Mining”. World Applied Sciences Journal, pp.32-40, 2010.

[3] T. P. Mo and J. H. Wang, “Design and implemetation of intrusion detection system,” Lecture Notes in Communication Systems and Information Technology, vol. 100, pp. 303-308, 2011.

[4] S. Suthaharan and K. Vinnakota, “An approach for automatic selection of relevance features in intrusion detection systems,” in Proc. of the 2011 International Conference on Security and Management (SAM’11), pp. 215-219, July 18-21, 2011, Las Vegas, Nevada, USA.

[5] On-line, KDD’99: http://kdd.ics.uci.edu/databases/kddcup99/task.html. [6] A. Lazarevic, V. Kumar, and J. Srivastava, “Intrusion detection: A

survey,” Managing Cyber Threats, pp. 19–78, June 2005. [7] Y. Li, J. Wang, Z. Tian, T. Lu and C. Young, “Building lightweight

intrusion detection system using wrapper-based feature selection mechanisms”. Computer and Security, vol. 28, pp.466-475, 2009.

[8] A. A. Olusola, A. S. Oladele and D. O. Abosede, “Analysis of KDD ’99 Intrusion Detection Dataset for Selection of Relevance Features”. Proceedings of the World Congress on Engineering and Computer Science, vol. 1, Oct-2010.

[9] H. G. Kayacik, A. N. Zincir-Heywood and M. I. Heywood, “Selecting Features for Intrusion Detection: A Feature Relevance Analysis on KDD 99 Intrusion Detection Datasets”. Association of Computer Machinery, 2006.

[10] On-line, NSL-KDD: http://www.iscx.ca/NSL-KDD/. [11] Z. Pawlak “Rough Set theory and its applications”. Journal of

Telecommunications and Information Technology, pp.7-10, Mar-2002. [12] B. Walczak and D. L. Massart, “Tutorial: Rough set theory,”

Chemometrics and Intelligent Laboratory Systems 47 (1), pp. 1–16, 1999.

[13] S. Rajasegarar, J. C. Bezdek, C. Leckie, and M. Palaniswami, “Elliptical animalies in wireless sensor networks,” ACM Transactions on Sensor Networks, vol. 6, no. 1, pp. 7:1-7:28, 2009.

[14] C. M. Bishop, Pattern Recognition and Machine Learning, Second Edition, Publisher: Springer, 2006.

[15] B. Predki, R. Slowinski, J. Stefanowski, R. Susmaga, Sz. Wilk: ROSE - Software Implementation of the Rough Set Theory. In: L. Polkowski, A. Skowron, eds. Rough Sets and Current Trends in Computing, Lecture Notes in Artificial Intelligence, vol. 1424. Springer-Verlag, Berlin, pp. 605-608, 1998.

[16] B. Predki, Sz. Wilk: Rough Set Based Data Exploration Using ROSE System. In: Z. W. Ras, A. Skowron, eds. Foundations of Intelligent Systems, Lecture Notes in Artificial Intelligence, vol. 1609. Springer-Verlag, Berlin, pp. 172-180, 1999.

[17] On-line, information ROSE: http://idss.cs.put.poznan.pl/site/rose.html.

Documents

[IEEE SOUTHEASTCON 2012 - Orlando, FL, USA (2012.03.15-2012.03.18)] 2012 Proceedings of IEEE Southeastcon - Relevance feature selection with data cleaning for intrusion detection system