20
FLORIDA INSTITUTE OF TECHNOLOGY USING THE DATA MINING CLASSIFIER SELECTION MODEL TO IMPROVE NETWORK INTRUSION DETECTION DATA MINING FOR CYBERSECURITY FINAL PROJECT SUBMITTED TO: DR. LIAM MAYRON IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR CYB 5675: DATA MINING BY CRAIG CANNON

Cyb 5675 class project final

Embed Size (px)

Citation preview

Page 1: Cyb 5675   class project final

FLORIDA INSTITUTE OF TECHNOLOGY

USING THE DATA MINING CLASSIFIER SELECTION MODEL

TO IMPROVE NETWORK INTRUSION DETECTION

DATA MINING FOR CYBERSECURITY

FINAL PROJECT

SUBMITTED TO:

DR. LIAM MAYRON

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR

CYB 5675: DATA MINING

BY

CRAIG CANNON

MELBOURNE, FLORIDAJUNE 15TH 2014

Page 2: Cyb 5675   class project final

USING THE DATA MINING CLASSIFIER SELECTION MODEL

TO IMPROVE NETWORK INTRUSION DETECTION

INTRODUCTION:

These days there are massive amounts of data being collected and stored in databases all over the world. This trend has been occurring now for decades. Databases are reaching sizes in the Terabyte range of data and could be rapidly approaching the Petabytes range in a few years. Inside these enormous data warehouses lay hidden invaluable information and knowledge that can only be extracted using some very creative data mining algorithms. These algorithms allow us to extract what is called nuggets of knowledge from large sets of data. Basically, they allow us to turn data into information. These algorithms were created originally for this purpose and have performed this purpose quite well. Nevertheless, over the years these algorithms have found themselves even more useful in the area of network intrusion detection.

In the last few decades cyber attacks have been on the rise worldwide. Security breaches have plagued governments, individuals, and even corporations for some time now and seem to be an increasing trend with no end in sight. To defend against these attacks network systems such as firewalls, endpoint security servers, intrusion prevention systems, and intrusion detection systems have been placed on many networks within the global internet infrastructure. Out of all of these defense systems, network intrusion detection systems using data mining techniques is proving itself to be the most promising defense against cyber attacks. When used with data mining techniques, network intrusion detection systems can analyze massive amounts of audit data while at the same time improve detection efficiency.

In the past few years researchers have been doing a lot of experimenting with various algorithms on network intrusion detection systems. Different approaches have been tried in an effort to improve both the effectiveness and efficiency of NIDS. As a result, detection has improved tremendously over the years and continues to be a major weapon in the cyber war arsenal. For this reason, choosing the best data mining classification method for network intrusion detection systems has become even more vital for future NID systems designs.

Many NID system designers have begun to take a closer look at the results of this research. One particular research technique that comes to mind involves discovering algorithms that distinctively work better on certain types of attacks. Findings in this area could help immensely in enhancing intrusion detection performance. Network Intrusion Detection System (NIDS) plays vital role of identifying various types of attacks. NIDS main role is to secure the network security and its information infrastructure.

The main purpose of NIDS is to detect intrusion among normal audit data which is in essence considered as a classification problem. Classification is defined as given a collection of records known as a training set where each record has a set of attributes one of which is called class. Classifications goal is to assign a class to previous unseen records. In order to do this task, data is divided into both a training set and a test set. The training set is used to build the model and the test set is used to validate the model. (Kumar, 2002) Below is a typical framework of a classification model. This framework gives a big picture overview of the experiment that is being

Page 3: Cyb 5675   class project final

undertaken using the KDD99 dataset. Basically two random samples of the data will be taken and one will be used as the training dataset while the other as the test set. See Figure 1 below.

Figure 1

The purpose of this project is evaluate a wide-ranging set of classifier algorithms against the KDD dataset in order to determine if they can demonstrate better performance in a particular attack category. A comparison of each algorithm will be completed and their performance results will be matched to their specific attack category. An attempt to detect attacks in Probe, DOS, U2R, and R2L categories will be used as the baseline foundation of this project. Each of these attack categories involve distinctive attack types that and are used commonly by attackers when hacking a network. Once we have identified which algorithm best detects which attack category, we will propose a model for designing an ideal network intrusion detection systems for future usage.

BACKGROUND:

It is a well known fact that earning classifier models is a critical problem in data mining. Audit type data from real world networks is often presented as a set of records with multiple attributes. One of those attributes is a special category attribute called class. Initially, a training set of records with known class labels are extracted. The idea is to learn a model for the class in terms of other attributes. Once this model has been identified the objective is to use this newly found model based on both the predicted and actual classes to predict the class of any given set of records in order to optimize the system overall.

Traditionally, the goal has been to reduce the number of misclassified records thereby in essence capitalize on accuracy. Many techniques in times past have been attempted to build efficient and effective classifier models. Three methods in particularly, techniques that learn rule-based models are principally well-liked in data mining.

Page 4: Cyb 5675   class project final

The first method is the Transductive Confidence Machines for K-Nearest Neighbors (TCM-KNN) technique. The TCM-KNN technique is a commonly used machine learning and data mining algorithm used for intrusion detection. It is highly effective in the areas of fraud detection, pattern recognition and outlier detection. Historically the TCM-KNN algorithm wasn’t considered an algorithm fit for intrusion detection. Nevertheless, that attitude has changed immensely over the past decade as research has proven the algorithms validity over time. Results have proven to be outstanding. It has been found that TCM-KNN has high detection rates and low false positives even when used on a limited data set for training. For instance an average true positive rate of 99.6% and a false positive rate of 0.1% was was reported on a random sampling subset of the KDD dataset. (Ngyun, 2008) TCM-KNN uses the computation of the confidence based upon randomness. Unlike conventional methods in machine learning, transduction can offer measures of dependability to individual points, and uses very extensive assumptions except for the iid assumption (the training as well as new (unlabeled) points are independently and identically distributed). TCM-KNN is an ideal method to use for pattern recognition, fraud detection, and outlier detection.(Yang, 2007)

The next method is the PNrule technique. PNrule technique was assessed on the KDD testing datasheet and performed quite well in its detection efforts. Returning a true positive rate of over 96 % against denial of service attacks (DOS) and a less than 10% false positive rate overall, the PNrule proved itself competent player in the data mining space. (Ngyun, 2008) Basically, the PNrule is a two stage general-to-specific framework based on finding rules that predict the presence of a target class (P-rules) and the absence of a target class (N-rules). The main inspiration is to learn a set of P-rules that together cover most of the positive examples and yet still its numerical importance. Initially, highly accurate rules are selected, but later accuracy is compromised in favor of support. This lessening of accuracy causes some negative examples or false positives to be supported. Now, we combine all the true positives and false positives collectively supported by the union of all the P-rules, and learn N-rules on this reduced data-set to remove the false positives. This two-phase approach is what makes the PN rule approach different. PNrule also uses a method that allows it to avoid over fitting and still maintain its flexibility. (Agarwal, 2000)

The final method is the Multiclass Support Vector machines (SVMs) method. SVM technically considered as a binary classification problem, is based on the idea of constructing optimal hyper-planes to improve generalization abilities. Because most real world pattern recognition applications are multi-class classification cases, multi-class SVM algorithms are ideal for solving these types of real world problems. SVM is a decomposition-based method that divides a multi-class problem into multiple binary problems in order to construct multiple two-class SVM classifiers and combine their classification results. There are three ways SVMs can be implemented using binary SVM algorithms: first there’s the one-vs-all, next there’s the one-vs-one, and finally there’s the error correcting output coding way. Among the existing decomposition approaches, the one-vs-all strategy has been regarded as a simple method with somewhat low accuracy when compared with other multi-class SVMs. The one-vs-all classifier is also extremely powerful and can produce results that are usually at least as accurate as other methods. (Xu, 2006) In times past, when SVM were evaluated on the KDD99 dataset, the results were promising. False positives were maintained at a very low level averaging out to about 0.6% while True Positives were relatively in the rage of 75% for DOS attacks and 82% for probe attacks using a small dataset of 10,000 records. (Ngyun, 2008)

Page 5: Cyb 5675   class project final

SYSTEM DESIGN AND IMPLEMENTATION:

The system design experiment evaluation atmosphere consists of three steps: First the environment is setup with the required hardware and software, the data mining software is selected and installed, and the dataset itself is chosen and preprocessed. Next, five discrete classifier algorithms are chosen from a wide-ranging set of accepted classifier algorithms. These were chosen because of they represent a number of different industries and they are well known conventional classifier algorithms not to mention that they are already included in the WEKA software. The five classifier algorithms chosen are: Naïve Bayes, J48 (C4.5 revision), Decision table, JRip (Ripper) and SMO. Finally, the experiments are run using different classifier algorithms, the results are recorded, and the performances of each classifier algorithm are compared. Based on the information obtain an ideal model is suggested.

All experimental iterations were done on the following hardware and software: Dell Latitude e6410 with Intel Core i5 2.53GHz processor, 4GB RAM, running on Microsoft Windows 7 operating system platform. The open source data mining machine learning package called WEKA version 3.6 for windows was used. WEKA is a data mining workbench that contains a large number of algorithms for classification and a lot of algorithms for data preprocessing, feature selection, clustering, and finding association rules. It’s an all-inclusive workbench and it’s written in Java and runs on multiple platforms. (Witten, 2011)

The experiment uses the KDD99 dataset which came about as a direct result of the 1998 DARPA Intrusion Detection Evaluation Program whose sole objective was to survey and evaluate research in intrusion detection. This standard set of pre audited data includes a wide variety of intrusions simulated in a military network environment. Nine weeks of raw TCP dump data for a local-area network (LAN) simulating a typical U.S. Air Force LAN was interspersed with multiple types of attacks. The raw training data contained seven weeks of TCP dump network traffic resulting in about four million connection records. (Stolfo, 1999)

A connection is a series of TCP packets starting and ending at some distinct times, between which data flows to and from a source IP address to a destination IP address. Labeled as either normal, or as an attack, each connection could only contain exactly one specific attack type. As a result four main categories of attacks were identified: DOS: denial-of-service, R2L: remote to local, U2R: user to remote, and Probing. (Stolfo, 1999) These are the four categories of attacks that will be used in this classifier experiment’s dataset. The actual type of attacks in the dataset as well as their categories can be seen here in Figure 2 below.

Figure 2

Page 6: Cyb 5675   class project final

As our test dataset, the KDD99 dataset contains one type of normal data and 22 different types of attacks. Each instance of the dataset contains 41 features and one final feature for classifying.(Nguyen, 2008) In order to be able to do the experiment, we filtered the original KDD99 dataset down to about 10% of the data. This data was randomly chosen and represents a cross-section of attack types. The size of the original data set had 494,020 records and we reduce our dataset down to 49,402 records. To do this we loaded the original dataset into WEKA, went to Choose to Filters to Unsupervised to Instances and chose the remove Percentage Filter. Afterwards, we double clicked on the name of the filter and put 90% in the percentage box which leaves only 10% of the data remaining. Figures 3 and 4 screenshots can be seen below showing the before and after results of the filter being applied to the original dataset. This new dataset will be used as training set. The next random sampling will remove 97% of the data and will only leave 3% or about 14,821 instances and it will be used as the test set.

Figure 3

Figure 4

Page 7: Cyb 5675   class project final

After the filter was applied, the different type of attacks had to be categorized based on the four chosen categories: DOS, Probe, U2R, and R2L. This required looking at the total numbers of each attack and putting them in their respective category type. This was all done manually after WEKA identified each attack type. An attack by category was formed for each dataset. Afterwards a distribution of connection type classifier evaluation table was created in Excel for each dataset as well. Figures 5 and 6 below shows screenshots of these Excel attack tables. These tables will be used later to do some calculations.

Figure 5

Figure 6

Page 8: Cyb 5675   class project final

Once the training set and test sets were created the time had come to select the classifiers and began training. The cross validation method was chosen as the test option because it avoids overlapping test sets. Using one group for testing and the rest for training allows time to fine tune the model building process. The experiment was started with the cross validation option and continued until all the algorithms were tested. Keeping a consistent testing method across the board is really the only way to have a true comparison and get a baseline standard established prior to improving the results by doing further iterations and recommending a model. Once that is done an accurate estimate can be identified and the results of each algorithm compared.

The first classifier chosen was the NaiveBayes algorithm. NaiveBayes is a powerful probabilistic algorithm that learns from training data the conditional probability of each attribute. It is called naïve because it relies on two simplifying assumptions: it assumes that the predictive attributes are conditionally independent given the class, and it posits that no hidden attributes influence the prediction process. (Nguyen, 2008) NaïveBayes does its classification by applying Bayes rule to compute the probability when given the exacting instance. Afterwards, NaïveBayes predicts the class with the highest subsequent probability. The goal of classification is to correctly forecast the value of a selected discrete class variable given a vector of predictors or attributes. NaïveBayes is very effective in many real world datasets because it learns to combine classifier predictions. Nevertheless, because of attribute redundancy its predictive accuracy is sometimes reduced. (Baitharu, 2013) Figures 7 and 8 below show the process of running the NaïveBayes algorithm on the KDD99 training dataset using the cross validation process.

Figure 7

Page 9: Cyb 5675   class project final

Figure 8

The second classifier chosen was the J48 algorithm. J48 formerly known as J48 is an optimized version of C4.5 version 8. (Nguyen, 2008) J48 can make precise forecasts from data as well as explain the patterns in it. J48 can deal effectively with many problems of numeric attributes such as: missing values, pruning, estimating error rates, complexity of decision tree induction, and generating rules from trees. (Baitharu, 2013) In terms of predictive accuracy, J48 performs slightly better than CART. The learning and classification steps of J48 are very fast; however, J48 lacks scalability and efficiency when it’s applied to large datasets. This was clearly shown in the experiment when J48 was applied to the training dataset. The system running the software came to a complete halt when attempting to process the entire KDD99 dataset with over 490,000 instances. Nevertheless, on both the test and training dataset J48 performed quite well. Figures 9 and 10 screenshots below shows the process of running the J48 algorithm on the KDD99 dataset using the cross validation process.

Figure 9

Page 10: Cyb 5675   class project final

Figure 10

The third classifier chosen was the Decision table algorithm which builds a decision table majority classifier. Decision Table builds by using a simple decision table majority classifier. It summarizes the dataset with a ‘decision table’ which contains the same number of attributes as the original dataset. Afterwards, a new data item is allocated to a category by finding the line in the decision table that goes with the non-class values of the data item. Decision Table uses the wrapper method to find a good subset of attributes for inclusion in the table. By eliminating attributes that contribute little or nothing to a model of the dataset, the algorithm creates a smaller and condensed decision table. (Kohavi, 1995) Figures 11 and 12 screenshots below shows the process of running the Decision Table algorithm on the KDD99 dataset using the cross validation process.

Figure 11

Page 11: Cyb 5675   class project final

Figure 12

The fourth classifier chosen was the JRip (Ripper) algorithm. JRip implements a propositional rule learner, Repeated Incremental Pruning to Produce Error Reduction (RIPPER). Ripper builds a rule set by frequently adding rules to a blank rule set until all constructive examples are covered. Rules are formed by greedily adding conditions to the predecessor of a rule until no negative examples are covered. JRip (RIPPER) proceeds by treating all the examples of a particular judgment in the training data as a class, and finding a set of rules that cover all the members of that class. Thereafter it proceeds to the next class and does the same, repeating this until all classes have been covered. (Rajput, 2000) JRip is a rule learner similar to the commercial rule learner RIPPER. The RIPPER rule learning algorithm assembles a rule set in which all positive examples are covered, and its algorithm performs proficiently noisy datasets. Before building a rule, the current set of training examples are partitioned into two subsets, a growing set and a pruning set. The rule is created from examples in the growing set. The rule set begins with an empty rule set and rules are added incrementally to the rule set until no negative examples are covered. To prune a rule, RIPPER chooses the deletion that make best use of the function (Witten, 2005)

The fifth and final classifier is the SMO algorithm. SMO can be either linear or non-linear. SMO is a supervised classifier. SMO uses two different datasets: training set and a test set. In a best case scenario situation the classes are linearly separable by a line that splits the two classes perfectly when in actuality there are many lines that separate the dataset and finding the best one becomes the "separating line". The best line is found by making the best use of the distance to the nearest points of both classes in the training set. The difficulty of the model has to be balanced between fitting on the training data and being generic. This can be achieved by allowing models which can make errors. SMO can make some errors to avoid over-fitting by minimizing the number of errors that will be made.(Baitharu, 2013)

DISCUSSION:

The instances that performed the best out of the five chosen algorithms were assessed on the KDD dataset. Figure 13 gives the results of our findings. Both the true positives (TP) and false positives (FP) of each algorithm was recorded. These are the factors that were used to identify the best algorithm for a particular category. The Total Time to build the model also was used in order to get some real world insight into if these models were actually applicable in terms of real world time constraints. I must say I was somewhat surprised at the results. The J48 classifier had both the highest true positive rate for the DOS category as well as the lowest false positive rate for the DOS category. This in essence proves that certain algorithms do indeed work

Page 12: Cyb 5675   class project final

better for certain attack categories which confirms the finds of other researchers like Nguyen and Choi. Another important insight which stood out involved the long total time taken to build the model by the JRip Algorithm. JRip took 914.82 seconds to build the model while NaiveBayes only took 10.55 seconds. The Decision Table algorithm had the second highest amount of Total time taking 642.55 seconds to build the model. Decision table also had only a .70 or 70% true positive rate for DOS indicating it wouldn’t be the best algorithm to use for finding denial of service attacks. Another interesting point is that based on my datasets there weren’t many U2R and R2L attack records included in either of the datasets. This made it very difficult to identify which algorithm would be best for these attacks. As shown in figure 13 below, most of the rates besides each of these attacks are zeroes. In essence this experiment was a great project. Crunching the numbers by building spreadsheets helped me to make more sense of what was actually happening with the numbers. Based of all of the findings, I feel that J48 classifier is best for DOS attacks while Decision table is best for PROBE attacks both having the highest true positive rates as well as the lowest false positive rates in their respective attack categories. U2R and R2L attack findings were inconclusive due to a lack of data for these attacks. As a results the propose model recommended would combine both J48 and Decision Table classifiers into a NID system to improve it efficiency. If the system is found to work too slow in the real world, Decision Table would be replace with the SMO algorithm which is fairly good at identifying PROBE attacks in less time than Decision table. In essence, designing an IDS system with these two algorithms working in parallel would be a well designed IDS system.

Figure 13

Page 13: Cyb 5675   class project final

CONCLUSION:

In this final report I have attempted to provide a understanding of research about data mining classifier selection models and their impact on intrusion detection systems. I have also tried to give some introductory understanding into different classifier algorithms and techniques used to build efficient intrusion detection models. The WEKA data mining workbench was used as the tool of choice to demonstrate how these models would actually be designed as well as how they would work in identifying certain types of attacks. Background information was given describing different classifier methodologies and citing how researchers are progressing in this area of data mining. A System design and implementation methodology was presented outlining the project I choose to try to give an understanding of some of the current research being done today in the area of data mining and cyber security. Overall the research in this project in my opinion clearly demonstrated that a correlation exists between specific data mining algorithms and specific attack types. Understanding these relationships as well as other factors such as total time and average accuracy can prove invaluable in future designs of intrusion detection systems. Nevertheless, there is much more research that can be done in this area. From my findings I have learned that identifying the relationship between certain algorithms paired with certain attack types is an important topic for both future IDS designs as well as future security risks mitigation efforts. Having a clearer understanding of this topic has made me more aware of both the intricacies and complexities of data mining algorithms when used in cyber security applications. As ubiquitous computing continues to change the technology infrastructure landscape, one thing is for sure, data mining techniques being used in cyber security technologies will continue to gain even more importance in the ongoing cyber war for many years to come.

ACKNOWLEDGEMENTS:

This project has been one of the most challenging research projects that I have ever done, yet it has been one of the most rewarding projects I have ever done as well. I would like to thank to my professor, Dr. Liam Mayron, whose encouragement, guidance and support from the beginning to the end has motivated me to do my very best and never accept anything less. His wisdom and insight has not only allowed me to develop better understanding of this subject but also has taught me to never give up and for that I am eternally grateful.

Page 14: Cyb 5675   class project final

REFERENCES

Agarwal, R., Joshi, M., (2000)PNrule: A New Framework for Learning Classifier Models in Data Mining (A Case-Study in Network Intrusion Detection) retrieved from: http://www.siam.org/meetings/sdm01/pdf/sdm01_30.pdf

Baitharu, T., Pani, S., (2013)A Survey on Application of Machine Learning Algorithms on Data Mining: International Journal of Innovative Technology and Exploring Engineering retrieved from: http://www.ijitee.org/attachments/File/v3i7/G1363123713.pdf Kohavi, R. "Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid" from: KDD-96 Proceedings (1996) retrieved from: http://www.aaai.org/Papers/KDD/1996/KDD96-033.pdf

Kumar, V. (2002)Data Mining Algorithms: Tutorial Presented at IPAM 2002 Workshop on Mathematical Challenges in Scientific Data Mining retrieved from: http://www.ipam.ucla.edu/publications/sdm2002/sdm2002_4007.pdf

Nguyen, H., Choi, D., (2008)Application of Data Mining to Network Intrusion Detection: Classifier Selection Model retrieved from: http://arxiv.org/ftp/arxiv/papers/1007/1007.1268.pdf

Rajput, A., Aharwal, R., (2000) J48 and JRIP Rules for E-Governance Data - retrieved from: http://www.cscjournals.org/csc/manuscript/Journals/IJCSS/volume5/Issue2/IJCSS-448.pdf

Stolfo, S., Fan, W., Lee, W., Prodromidis, A’, Cost-based Modeling and Evaluationfor Data Mining With Application to Fraud and Intrusion Detection: Results from the JAM Project retrieved from: https://kdd.ics.uci.edu/databases/kddcup99/task.html

Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 3rd edn. Morgan Kaufmann, San Francisco (2005)

Xu, X., (2006) Adaptive Intrusion Detection Based on Machine Learning: Feature Extraction, Classifier Construction and Sequential Pattern Prediction retrieved from: http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=8CD0CFADEE68079885E5FD4F71BC4F7E?doi=10.1.1.107.9575&rep=rep1&type=pdf

Yang, L., Guo, L. (2007)An active learning based TCM-KNN algorithm for supervised network intrusion detectionretrieved from: http://sourcedb.ict.cas.cn/cn/ictthesis/200907/P020090722621827712247.pdf