5
Improving Network Security Using Machine Learning Techniques Shaik Akbar 1 , Dr. J.A. Chandulal 2 , Dr. K. Nageswara Rao 3 , G. Sudheer Kumar 4 1 Department of Computer Science and Engineering, SVIET, Nandamuru, Andhra Pradesh, India 2 Department of Computer Science and Engineering, GITAM University, Visakhapatnam, Andhra Pradesh, India 3 Department of Computer Science and Engineering, P.V.P. S. I .Technology, Vijayawada, Andhra Pradesh, India 4 Department of Computer Science and Engineering, SVIET, Nandamuru, Andhra Pradesh, India ([email protected] , [email protected] , [email protected] , [email protected] ) Abstract - Discovery of malicious correlations in computer networks has been an emergent problem motivating extensive research in computer science to develop improved intrusion detecting systems (IDS). In this manuscript, we present a machine learning approach known as Decision Tree (C4.5) Algorithm and Genetic Algorithm, to classify such risky/attack type of connections. The algorithm obtains into consideration dissimilar features in network connections and to create a classification rule set. Every rule in rule set recognizes a particular attack type. For this research, we implement a GA, C.45 and educated it on the KDD Cup 99 data set to create a rule set that can be functional to the IDS to recognize and categorize dissimilar varieties of assault links. During our study, we have developed a rule set contain of six rules to classify six dissimilar attack type of connections that fall into 4 modules namely DoS, U2R, root to local and probing attacks. The rule produces works with 93.70% correctness for detecting the denial of service type of attack connections, and with significant accuracy for detecting the root to local, user to root and probe connections. Results from our experiment have given hopeful results towards applying enhanced genetic algorithm for NIDS. Keywords - Intrusion Detection, Genetic Algorithm, C4.5 Algorithm, KDDCup’p99, Computer networks, Data mining I. INTRODUCTION In the previous few existences, the information uprising has lastly come of time. Extra than ever before, we observe that the Internet has altered our lives. The potentialities and chances are boundless; unluckily, so too are the threats and probability of hateful interruptions. Intruders can be divided into two categories: unknown and insiders. Outsiders are interlopers who advance your structure from exterior of your network and who may possibly assault your outside incidence (i.e. spoil net servers, advance spam during electronic mail servers, etc.) They can also effort to go around the firewall and assault machines on the interior network. Insiders, in difference, re official customers of your inner network who abuse advantages, imitate higher advantaged users, or use rightness information to increase contact from outside causes Axelsson 2000 [1]. Intrusion detection systems have emerged to identify exploits which imperil the honesty, discretion or accessibility of a reserve as an effort to give an answer to obtainable safety problems. The proceeds of identifying exploit those efforts to cooperation the discretion, honesty or accessibility of a source Kruegel C et.al 2002 [2]. Intrusion is the act of breaking the security policy or legal protections that affect to an information system. An intruder is somebody efforting to split into or abuse your system Kayacik et.al 2003 [3]. Stateless firewalls suffer from numerous significant drawbacks that make them inadequate to protect networks by themselves. There are two common groups of assaults which interruption detection expertise effort to recognize – irregularity detection and abuse detection, Anomaly discovery recognizes actions that differ from recognized models for customers, or factions of customers. Anomaly discovery usually occupies the creation of information supports that include the outlines of the experiential actions. The next common advance to interruption detection is abuse detection. This advance absorbs the difference of a user’s actions with the identified performances of assailant’s difficulty to enter a system Axelsson 2000[1]. Although irregularity discovery classically develops entrance examining to identify when a confident recognized metric has been reached, abuse discovery advance frequently use a rule-based advance. When applied to abuse discovery, the systems become situations for network assaults. The interruption detection machine recognizes a probable assault if a user’s actions are established to be dependable with the traditional rules Bass T. 2000 [4]. KDDCup99 Data set is used for Intrusion Detection and the development replica is checked on the data set. The process of Artificial Intelligence for detection of intrusions is the method to build precise or correct IDS. To recognize misuse, anomaly detection and detecting key models are identified by using the rule based, Genetic Algorithm and C4.5 algorithm techniques. 978-1-4673-1344-5/12/$31.00 ©2012 IEEE

[IEEE 2012 IEEE International Conference on Computational Intelligence and Computing Research (ICCIC) - Coimbatore, India (2012.12.18-2012.12.20)] 2012 IEEE International Conference

Embed Size (px)

Citation preview

Improving Network Security Using Machine Learning Techniques

Shaik Akbar1, Dr. J.A. Chandulal2, Dr. K. Nageswara Rao3, G. Sudheer Kumar4

1 Department of Computer Science and Engineering, SVIET, Nandamuru, Andhra Pradesh, India 2Department of Computer Science and Engineering, GITAM University, Visakhapatnam, Andhra Pradesh, India 3Department of Computer Science and Engineering, P.V.P. S. I .Technology, Vijayawada, Andhra Pradesh, India

4Department of Computer Science and Engineering, SVIET, Nandamuru, Andhra Pradesh, India ([email protected], [email protected], [email protected], [email protected])

Abstract - Discovery of malicious correlations in computer networks has been an emergent problem motivating extensive research in computer science to develop improved intrusion detecting systems (IDS). In this manuscript, we present a machine learning approach known as Decision Tree (C4.5) Algorithm and Genetic Algorithm, to classify such risky/attack type of connections. The algorithm obtains into consideration dissimilar features in network connections and to create a classification rule set. Every rule in rule set recognizes a particular attack type. For this research, we implement a GA, C.45 and educated it on the KDD Cup 99 data set to create a rule set that can be functional to the IDS to recognize and categorize dissimilar varieties of assault links. During our study, we have developed a rule set contain of six rules to classify six dissimilar attack type of connections that fall into 4 modules namely DoS, U2R, root to local and probing attacks. The rule produces works with 93.70% correctness for detecting the denial of service type of attack connections, and with significant accuracy for detecting the root to local, user to root and probe connections. Results from our experiment have given hopeful results towards applying enhanced genetic algorithm for NIDS.

Keywords - Intrusion Detection, Genetic Algorithm, C4.5 Algorithm, KDDCup’p99, Computer networks, Data mining

I. INTRODUCTION

In the previous few existences, the information uprising has lastly come of time. Extra than ever before, we observe that the Internet has altered our lives. The potentialities and chances are boundless; unluckily, so too are the threats and probability of hateful interruptions. Intruders can be divided into two categories: unknown and insiders. Outsiders are interlopers who advance your structure from exterior of your network and who may possibly assault your outside incidence (i.e. spoil net servers, advance spam during electronic mail servers, etc.) They can also effort to go around the firewall and assault machines on the interior network. Insiders, in difference, re official customers of your inner network who abuse advantages, imitate higher advantaged users, or use

rightness information to increase contact from outside causes Axelsson 2000 [1].

Intrusion detection systems have emerged to identify exploits which imperil the honesty, discretion or accessibility of a reserve as an effort to give an answer to obtainable safety problems. The proceeds of identifying exploit those efforts to cooperation the discretion, honesty or accessibility of a source Kruegel C et.al 2002 [2]. Intrusion is the act of breaking the security policy or legal protections that affect to an information system. An intruder is somebody efforting to split into or abuse your system Kayacik et.al 2003 [3]. Stateless firewalls suffer from numerous significant drawbacks that make them inadequate to protect networks by themselves.

There are two common groups of assaults which interruption detection expertise effort to recognize – irregularity detection and abuse detection, Anomaly discovery recognizes actions that differ from recognized models for customers, or factions of customers. Anomaly discovery usually occupies the creation of information supports that include the outlines of the experiential actions. The next common advance to interruption detection is abuse detection. This advance absorbs the difference of a user’s actions with the identified performances of assailant’s difficulty to enter a system Axelsson 2000[1]. Although irregularity discovery classically develops entrance examining to identify when a confident recognized metric has been reached, abuse discovery advance frequently use a rule-based advance. When applied to abuse discovery, the systems become situations for network assaults. The interruption detection machine recognizes a probable assault if a user’s actions are established to be dependable with the traditional rules Bass T. 2000 [4].

KDDCup99 Data set is used for Intrusion Detection and the development replica is checked on the data set. The process of Artificial Intelligence for detection of intrusions is the method to build precise or correct IDS. To recognize misuse, anomaly detection and detecting key models are identified by using the rule based, Genetic Algorithm and C4.5 algorithm techniques.

978-1-4673-1344-5/12/$31.00 ©2012 IEEE

II. ATTRIBUTE SELECTION

The information gets determine used in C4.5 algorithm is utilized to choose the check feature at every node in the hierarchy. Such a compute is referred to as a feature choice determine or calculate of the integrity of split. The feature with the greatest information gain preferred as the examination attribute for the near node. This attribute decreases the information essential to categorize the models in the ensuing dividers. Such an information-theoretic advance decreases the probable quantity of checks required to categorize an object and assurances that an easy tree is produce.

III. EXISTING ALGORITHM: INFORMATION GAIN

Let S be a set of training set models with their identical tags. Suppose there are m modules and the training set includes Si models of class ‘I‘ and ‘s’ is the entire amount of models in the training set. Expected information essential to categorize a certain model is intended by: i=1 I(S1,S2,……Sm) = - ∑ Si / S log2Si (1)

m

A feature F with values {f1,f2, ………fv} can split the training set into v subsets In addition let Sj have Sij trials of class i. Entropy of the feature F is V

E(F)= ∑ S1j + …….+Smj / S * I(S1j,S2j,…..Smj) (2) j=1 Information gain for F can be intended as: Gain(F) = I( S1,S2, …… ,Sm) - E(F) (3)

In this study, information gain is considered for class tags by using a binary intolerance for all class. That is, for each class, a dataset instance is measured in-class, if it has the equivalent label; out-class, if it has a dissimilar label. Consequently as different to manipulating one information gain as a general assess on the significance of the attribute for every class, so guess an information gain for all class. Thus, this indicates how well the attribute can categorize the particular class from extra classes.

IV. PROPOSED ENHANCEMENT: GAIN RATIO CRITERION

The thought of information gain recognized previous leans to maintain features that have an enormous amount

of ideals. For instance, if we have a feature D that has a separate value for every record, then Info (D, T) is 0, thus Gain (D, T) is maximal. To reimburse for this, it was recommended in [6] to use the subsequent ratio in its place of gain. Split info is the information due to the split of T on the foundation of the value of the type feature D, which is defined by

n

Split Info(x) = -∑ |Ti| / |T|.log2 |Ti| / |T| (4) i=1 And the gain ratio is then calculated by GainRatio(D,T) = Gain(D,T)/SplitInfo(D,T) (5)

The gain ratio, positions the sum of helpful information formed by split, i.e., that shows cooperative for categorization. If the split is close to slight, split information will be tiny and this ratio will be unstable. To avoid this, the gain ratio typical chooses a test to exploit the ratio above, topic to manage that the information gain must be great, at least as large as the average gain over all tests checked.

V. CLASSIFYING AND DETECTING ANOMALIES Misuse detection is done through applying rules to the

test data. Test data is collected from the KDDCUP Data set. The test data is stored in the database. The rules are applied as SQL query to the database. This classified data under different attack categories as follows: 1) DOS 2) Probe 3) U2R 4) R2L The C4.5 algorithm generates a decision tree, from the origin node, by selecting one outstanding feature with the maximum information gain as the examination for the present node. In this work, Enhanced C4.5, by selecting one enduring feature with the highest information gain ratio as the check for present node is considered a afterward edition of the C4.5 algorithm, will be used to build the decision trees for categorization. From the table 3 it is clear that Enhanced C4.5 outperforms the classical C4.5 algorithm Split info is the information owing to the split of T on the foundation of the rate of the definite feature D, which is clear by n Split Info(x) = -∑ |Ti| / |T|.log2 |Ti| / |T| (4) i=1

2012 IEEE International Conference on Computational Intelligence and Computing Research

And the gain ratio is then calculated by GainRatio (D,T) = Gain(D,T) / SplitInfo(D,T) (5)

In Enhanced C4.5 the gain ratio, states the amount of helpful data created by split, i.e., that shows obliging for categorization. If the split is near-trifling, split information will be little and this ratio will be unstable. To avoid this, the gain ratio form chooses a check to exploit the ratio above, topic to the restriction that the information gain must be great, at slightest as great as the standard gain over all assessments studied.

VI. CONCLUSIONS OVERALL PERFORMANCE FOR C4.5 ALGORITHM VS ENHANCED C4.5

ALGORITHM

This table1 illustrates the overall detection rate and false positive rate for C4.5 and Enhanced C4.5 algorithm. Enhanced C4.5 gives improved accuracy for DoS, Probe, R2L and U2R categories compared to C4.5 algorithm.

TABLE 1

OVERALL DETECTION RATE AND FALSE POSITIVE RATE FOR C4.5 AND ENHANCED C4.5 ALGORITHM

Sl. No

Attack Category

Detection Rate (%)

(C4.5)

Detection Rate (%)

(Enhanced C4.5)

False Positive

(%)

(Enhanced C4.5)

1 DoS 90.6 92.92 0.085

2 Probe 84.0 88.29 0.152

3 U2R 83.6 84.00 0.220

4 R2L 53.7 66.91 0.398

Average Success Rate

77.975 83.03 0.213

VII. MODEL RESULT SCREEN SHOTS

Fig. 1. Showing KDDCUP Decision Tree Data Set

Fig. 2. Entropy and Gain Ratio Values of All Attributes

2012 IEEE International Conference on Computational Intelligence and Computing Research

VIII. FUTURE DETECTION GENETIC ALGORITHM

OVERVIEW

List illustrates the chief steps of the prepared detection algorithm as well as the training process. It primary produces the first population and loads the network audit data. Then the original population is developed for an amount of creations. In every formation, the qualities of the rules are initially considered, and then amounts of best-fit rules are chosen. The training technique creates by arbitrarily producing an original population of rules (Step 1). Step 2 estimates the whole amount of records in the audit data. Step 3 estimates the fitness of every rule and choose the best-fit rules into novel population. Step 4 guesses the rank selection of individuals. Step 5-7 apply the crossover and mutation operators to each rule in the novel population. Step 8 prefers the top greatest chromosomes into novel population. Finally, Step9 confirms and chooses whether to stop the training procedure or to go into the next creation to maintain the advance process. A. Solution Steps of the Detection Algorithm

Algorithm: Rule set creation with Genetic Algorithm Input: Amount of productions, Set Binary String, Population range, Crossover possibility, Mutation possibility. Output: A position of chosen attributes.

Step 1) Initialize the Population randomly Step 2) Total number of Records in the Training Set Step 3) Estimate Fitness = f(x)/ f (sum) Where f (x) is the fitness of individual x and f is the entire fitness of all individuals Step 4) Rank Selection Ps(i) = r(i) / rsum

Where Ps(i) is probability of selection individual r(i) is rank of individuals rsum is sum of all fitness values.

Step 5) For every Chromosome in the novel Population Step 6) Apply regular Crossover operator to the Chromosome Step 7) Apply Mutation operator to the Chromosome Step 8) Choose the top greatest 60% of Chromosomes into new population Step 9) if the amount of creations is not attained, go to Step 3.

IX. EXPERIMENTAL RESULTS

From the above implementation we have successfully generate some rules that classify the stated attack connections and for applying Genetic Algorithm on selected feature set and find the fitness value for each generation. This section reports four different attack categories that can recognize.

TABLE 2

ENHANCED RULE BASED GA - DETECTION RATE FOR DOS, R2L, U2R, PROBE ATTACKS

Sl. No

Attack Category

Detection Rate (%)

False Positive (%)

1 DoS 93.70 0.063

2 R2L 88.85 0.112

3 U2R 92.50 0.075

4 Probe 95.33 0.055

Average Success Rate 92.595 0.076

TABLE 3 OVERALL PERFORMANCE COMPARISONS OF G.A VS ENHANCED G.A.

The graph in figure 3 shows the performance of G.A and Enhanced G.A in terms of accuracy for the DoS, R2L, U2R, Probe.

0

20

40

60

80

100

DoS Probe U2R R2L

Attack Categories

Dete

ctio

n R

ate Detection Rate (%)

(Hoffman)Detection Rate (%)(Selvakani)Detection Rate (%)(Enhanced G.A)

Fig. 3. Shows the performance of G.A and Enhanced G.A

Sl. No

Attack

Category

Detection

Rate (%)

(Hoffman)

Detection Rate

(%) (Selvak

ani)

Detection

Rate (%)

(Enhanced

G.A)

False Positive

(%) (Enhanced G.A)

1 DoS 82.9 86.7 93.70 0.063

2 Probe 75.3 79.1 95.33 0.112

3 U2R 73.1 71.2 92.50 0.075

4 R2L 85.3 83.3 88.85 0.055

Average Success Rate 79.15 80.075 92.595 0.076

2012 IEEE International Conference on Computational Intelligence and Computing Research

TABLE 4

PERFORMANCE COMPARISON OF ENHANCED G.A VS ENHANCED C4.5

The graph in figure 4 shows the performance of improved G.A and enhanced C4.5 in terms of accuracy for the DoS, R2L, U2R, Probe categories.

Fig. 4. Shows the Performance of Enhanced G.A an Enhanced C4.5

Algorithm

X. CONCLUSION AND FUTURE WORK The improved Genetic Algorithm is a well proper method for Intrusion Detection compared to enhanced C4.5 algorithm. Obtain different classification rules for Intrusion Detection through Genetic Algorithm. The future Genetic Algorithm presents the Intrusion Detection System for detecting DoS, R2L, U2R, Probe from KDDCUP99 Dataset. The outputs of the experiments are satisfactory with an average success rate of 92.595% and the overall results of the technique implemented are good. In Future we have to implement with other features and different classification methods.

REFERENCES

[1] Axelsson S. 2000, Intrusion Detection Systems: A Survey and Taxonomy, Technical Report, Dept.of Computer Engineering, Chalmers University.

[2] Kruegel C and F Valeru. 2002, Stateful Intrusion Detection for High-Speed Networks, Proceddings of the IEEE Symposium on Research on Security and Privacy, 285-293.

[3] Kayacik G., Zincir-Heywood N., Heywood M.2003, On the Capability of an SOM-based Intrusion Detection System, proceedings of International Joint Conference on Neural Networks.

[4] Bass T.2000, Intrusion detection systems and multisensor data fusion, Communications of the ACM, Vol.43, 99-105.

[5] S.Selvakani k, Rengan S Rajesh, “Integrated Intrusion Detection System Using Soft Computing”, IJNS, Vol.10, No.2, pp.87-92, March 2010.

[6] Bridges S.M and Vaughn R.B, “Fuzzy Data Mining and Genetic Algorithms Applied to Intrusion Detection”, Proceedings of 12th Annual Candian Information Technology Security Symposium, pp.109-122, 2000.

[7] Crosbie Mark and Gene Spafford 1995, “Applying Genetic Programming to Intrusion Detection”. In Proceeding of 1995 AAAIFall Symposium on Genetic Programming, pp.1-8 Cambridge, Massachusetts.

[8] Chitur. A, “Model Generation for an Intrusion Detection System using Genetic Algorithms”, High School Hornors Thesis, http”//www/cs columibi.edu/ids/publications/gaidsthesis 01.pdf accessed in 2006.

[9] C. Xiang and S.M. Lim, “Design of multiple-level hybrid classifier for intrusion detection system,” in IEEE Transaction on System Man, Cybernetics Part A, Cybernetics, Vol.2, No.28, Mystic, CT, pp. 117-122, May, 2005.

[10] J. Shavlik and M. Shavlik, “Selection, combination, and evaluation of effective software sensors for detecting abnormal computer usage, “Proceedings of the First International Conference on Network security, Seattle, Washington, USA,pp. 56-67, May 2003.

Sl. No

Attack

Category

Detection Rate (%) (Enhance

d G.A)

False Positive

(%) (Enhanced G.A)

Detection Rate (%) (Enhanced C4.5)

False Positive (%) (Enhanced

C.4.5)

1 DoS 93.70 0.063 92.92 0.085

2 Probe 95.33 0.112 88.29 0.152

3 U2R 92.50 0.075 84.00 0.220

4 R2L 88.85 0.055 66.91 0.398

Average Success

Rate 92.595 0.076 83.03 0.213

0

20

40

60

80

100

DoS Probe U2R R2L

Attack Categories

Dete

ctio

n R

ate

Detection Rate (%) (EnhancedG.A)Detection Rate (%) (EnhancedC4.5)

2012 IEEE International Conference on Computational Intelligence and Computing Research