Application of Malicious URL Detection In Data Mining

International Journal of Advance Foundation and Research in Computer (IJAFRC)

Volume 2, Issue 7, July - 2015. ISSN 2348 4853

12 | 2015, IJAFRC All Rights Reserved www.ijafrc.org

Application of Malicious URL Detection In Data Mining Mr. Jadhav Bharat S.*1, Dr. Gumaste S.V. *2.

*1 M.E. student Department of Computer Engineering, Sharadchandra Pawar College of Engineering,

Dumbarwadi, Otur, Pune, Maharashtra, India

*2 Associate Professor And Head Department of Computer Engineering, Sharadchandra Pawar College of

Engineering, Dumbarwadi , Otur, Pune, Maharashtra, India [email protected]*1 , [email protected]*2

A B S T R A C T

Recently, major computer attacks are launched by visiting a malicious webpage. In this paper we

have to construct a real-time system that uses machine learning techniques to detect malicious URLs

(spam, phishing, exploits, and so on). So that, we have determine techniques that involve classifying

URLs based on their lexical and host-based features, as well as online learning to process large

numbers of examples and adapt quickly to evolving URLs over time. However, in a real-world

malicious URL detection task, the ratio between the number of malicious URLs and legitimate URLs is

highly imbalanced, making it very inappropriate for simply optimizing the prediction accuracy.

Besides, another limitation of the previous work is to assume a large amount of training data is

available, which is impractical as the human labeling cost could be quite expensive. A user can be

tricked into voluntarily giving away confidential information on a phishing page or become victim to

a drive-by download resulting in a malware infection. A malicious URL is a link pointing to a malware

or a phishing site, and it may then propagate through the victim's contact list. Moreover, hacker

sometimes might use social engineering tricks making malicious URLs hard to be identified. To solve

these issues, in this paper, we present a novel framework of Cost-Sensitive Online Active Learning

(CSOAL).

Index Terms : Malicious URL Detection, Cost-Sensitive Learning, Online Learning, Active Learning.

I. INTRODUCTION

The WWW allows people to access all information on the internet, but it also brings fake information, such as

fake drug, malware, and so on. Criminal enterprises such as spam-advertised commerce (e.g., counterfeit

watches or pharmaceuticals), financial fraud (e.g., via phishing) and as a vector for propagating malware

(e.g., so-called drive-by downloads). [1][2]A user accesses all kinds of information (Trusted or Suspicious)

on the Web by clicking on a URL (Uniform Resource Locator) that links to a particular website. It is thus

important for internet users to find the risk of clicking a URL in order to avoid check accessing the malicious

web sites.

Although the exact adversary mechanisms behind web criminal activities may vary, they all try to lure users

to visit malicious websites by clicking a corresponding URL (Uniform Resource Locator)[3]. The most

motivational things behind these schemes may differ; the common thing among them is the requirement that

unsuspecting users visit their sites.




These visited sites can be driven by email. Web search results or links from other Web pages, but all require

the user to take some action, such as clicking, that specifies the desired Uniform Resource Locator (URL).A

URL is called malicious (also known as black) if it is created in a malicious purpose and leads a user to a

specific threat that may become an attack, such as spyware, malware, and phishing. Malicious URLs are a

major risk on the web. Therefore, detecting malicious URLs is an essential task in network security

intelligence. If anyone could inform users before visiting that a particular URL was dangerous to visit, so the

problem could be avoided.

The security community has responded by developing blacklisting services. And that services provided to

user that the particular site is malicious or not. These blacklists are constructed by extracting the features

from the URL. And depending on the features classifier can divide them into white list and black list.

Although, many Suspicious sites are not blacklisted either because they are recently launched, were never

visited by user, or were checked incorrectly (e.g., due to cloaking)[4][5][6][7]. To address this problem,

some client-side systems analyze the content or behavior of a Web site as it is visited. But, in addition to run-

time overhead, these approaches can expose the user to the very browser-based vulnerabilities that we seek

to avoid.

In this paper, we focus on a complementary part of the design space: URL classificationthat is, classifying

the reputation of a Web site entirely based on the URL. The motivation is to provide inherently better

coverage than blacklisting based approaches (e.g., correctly predicting the status of new sites) while avoiding

the client-side overhead and risk of approaches that analyze Web content on demand[8][9][10][11]. In

particular, we explore the use of statistical methods from machine learning for classifying site reputation

based on the relationship between URLs and the lexical and host-based features that characterize them.

II. PROBLEM STATEMENT

Our main purpose, behind that is we treat URL as a binary classification problem where true examples are

malicious URLs and false examples are benign URLs. This approach to the problem can succeed if the

distribution of extracted feature values for malicious examples is different from benign examples, the

training set shares the same extracted feature distribution as the testing set.

Also, we classify URLs based only on the relationship between URLs and the lexical and host-based features

that characterize them, and we do not consider two other kinds of potentially useful sources of information

for features: the URLs page information, and the content of the URL[12][13][14][15] (e.g., the URL which

embedded the page or email).

This information is very useful to improve classification accuracy, so we exclude it for following reasons.

1. Avoiding downloaded page material is safer for the users.

2. Classifying a URL with a trained model is a lightweight operation compared to first downloading the

page and then using its contents for classification.

3. Concentrating on URL features makes the classifier applicable to any context in which URLs are found

(Web pages, email, chat, calendars, games, etc.), rather than dependent on a particular application

setting.

4. Reliably obtaining the malicious version of a page for both training and testing can become a difficult

practical issue.




Suspicious sites have show the ability to cloak the content of their Web pages, that is, showing different

contents to different clients .For example, a Suspicious server may send benign versions of a page to honey

pot IP addresses that belong to security practitioners, but send Suspicious versions to other clients[10][15].

III. URL RESOLUTION

URLs are human-readable text strings. Through a multistep resolution process, browsers translate each and

every URL into instructions that locate the server hosting the site and specify where the site or resource is

placed on that host[6][8][10][14]. Standard syntax of URL is

[Protocol]:// [hostname][path]

Fig. 1. Example of a Uniform Resource Locator (URL) and its components.

Contents of Protocol

1) Protocol:- This portion of the URL indicates which network protocol should be used to fetch

the requested resource. The most usable protocols are Hypertext Transport Protocol or HTTP

(http).In Figure 1, all of the example URLs specify the HTTP protocol[10].

2) Path:-This portion of a URL is the path name of a file on a local computer. In Figure 1, the

path in the example is /~jtma/url/example.html. The path tokens, delimited by various

punctuation such as slashes, dots, and dashes, and it shows how the site is organized[11]

3) Hostname :- This portion of URL is the identifier for the Web server. Mainly it is a machine-

readable Internet Protocol (IP) address, but from the user perspective it is a human-readable

domain name.In IPv4 addresses are 32-bit integers that are mainly represented as a dotted

quad. In dotted quad notation, we divide the 32-bit address into four 8-bit bytes[15].

IV. FEATURES OF URL

In this paper we have proposed a system For detecting the malicious URL. To practically implement the

proposed System and detect malicious URL we have taken total 700 website URL in that 500 websites URL

are Real website URL i.e. without any malicious data in it. And remaining 200 website URL are malicious

ones. But they are randomly placed in our dataset .So to detect which website URL is malicious URL, we have

proposed different features of extraction they are length, Number of dots, TTL, get info.

1) Length: - We find the length and on that basis of the length of the website URL we can detect that the




URL is real or malicious one

2) Number of full-stop: - We can also detect the URL is malicious or not on the basis of dots present in

the whole URL

3) TTL:- It stands for Time to live, Time-to-live (TTL) is a value in an Internet Protocol (IP) packet that

tells a network router whether or not the packet has been in the network too long and should be

discarded. For a number of reasons, packets may not get delivered to their destination in a

reasonable length of time.

4) Get Info:-In this parameter the information about the registrar is present , here detail inform of

every website is present i.e. on Whose name the Website is registered , Complete details of that

Registrar are present here . It is also one of the feature on the basis of which we can detect the URL is

Malicious or not .

5) Date: - is also one of the feature for detecting the URL is Malicious or not . Date on which the Website

URL was launched can help in detecting the website is malicious or not

6) Who is Connect:- It gives the information about the server i.e. when it is registered date of the

registration name of the registrar , i.e. all kinds of information of website

V. SYSTEM ARCHITECTURE

Figure 1: Framework for malicious URL detection

Primary target of this paper is to add to a framework which will manage the way that every time getting real

class of the example is impractical and will consider the expense of the misclassification to upgrade the

classifier in the event of endure misfortune. In proposed the online dynamic learning with cost sensitivity

(ODLCS) which will primary target of proposed framework, which is expressed previously. The target of

directed malicious URL discovery is to manufacture a prescient model that can unequivocally predict if an




approaching URL sample is noxious or not[10][11][12][13][14]. If all else fails, this can be portrayed as a

binary classification errand where malicious URL examples are from positive class ("+1") and typical URL

occurrences are from negative ("-1"). For an online pernicious URL recognition responsibility, the objective is

to make an online learner to incrementally assemble a arrangement model from a gathering of URL

preparing information occasions by method for a online learning fashion. In particular, for every one

adapting round, the learner first gets another approaching URL event for location; it then applies the

classification model to anticipate in case it is malicious then again not; around the end of the adapting round,

if reality class name of the sample can be revealed from the earth, the learner will make usage of the checked

case to redesign the characterization model at whatever point the order is erroneous generally speaking, it is

normal to apply web figuring out how to comprehend online malicious URL detection. In any case, it is

unfeasible to explicitly apply a current online learning framework to settle these issues[4][5][6][8].

This is by virtue of a schedule online classification undertaking typically acknowledge the class label of every

approaching event will be revealed keeping in mind the end goal to be used to upgrade the classification

model toward the end of every learning round. Plainly it is unfathomable or exceedingly rich if the learner

queries the class name of every approaching event in an online malicious URL detection assignment. To

address this test, in the proposed framework to research a novel system of ODLCS as demonstrated in Figure

2. Generally speaking, the proposed ODLCS system tries to address two key troubles in a systematic and

synergic learning philosophy:

1. The learner must choose when it ought to query the class label of an approaching URL case.

2. How to update the classifier in the best path where there is another marked URL event. VI. DATASET

For the experiment we take dataset from http://sysnet.ucsd.edu/projects/url/. The original data set was

created in purpose to make it somehow class-balanced. In suggested system to produce a separation by

sampling from the original data set to make it close to a more realistic distribution scenario where the

number of normal URLs is significantly larger than the number of malicious URLs.

VII. CONCLUSION

In this paper proposed a novel system of Online Dynamic Learning with Cost Sensitivity (ODLCS) to taking

care of real-world applications in the classification domain like online malicious URL recognition

undertaking. Also we extract the feature from the URL.By using this feature we classify them as positive and

negative .After training the classifier new entry of URL is tested and classify that into the malicious and

normal URL.

VIII. REFERENCES

[1] Jialei Wang, Peilin Zhao, and Steven C.H. Hoi, Member, IEEE, Cost-Sensitive Online Classification, VOL.

26, NO. 10, OCTOBER 2014

[2] Peilin Zhao, Steven C.H. Hoi School of Computer Engineering Nanyang Technological University 50

Nanyang Avenue, Singapore 639798 Cost-Sensitive Online Active Learning with Application to

Malicious URL Detection August 1114, 2013

[3] R. Akbani, S. Kwek, and N. Japkowicz, Applying support vector machines to imbalanced datasets, in

Proc. 15th ECML, Pisa, Italy, 2004, pp. 3950.




[4] B. R. Bocka, Methods for multidimensional event classification: A case study using images from a

Cherenkov gamma-ray telescope, Nucl. Instrum. Meth., vol. 516, no. 23, pp. 511-528, 2004.

[5] G. Blanchard, G. Lee, and C. Scott, Semi-supervised novelty detection, J. Mach. Learn. Res., vol. 11, pp.

29733009, Nov. 2010.

[6] V. Chandola, A. Banerjee, and V. Kumar, Anomaly detection: A survey, ACM CSUR, vol. 41, no. 3,

Article 15, 2009.

[7] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, SMOTE: Synthetic minority over-

sampling technique, J. Artif. Intell. Res., vol. 16, no. 1, pp. 321357, 2002.

[8] K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer, Online passive-aggressive

algorithms, J. Mach. Learn. Res., vol. 7, pp. 551585, Mar. 2006.

[9] K. Crammer, M. Dredze, and F. Pereira, Exact convex confidence weighted learning, in Proc. NIPS,

2008, pp. 345352.

[10] P. Domingos, Metacost: A general method for making classifiers cost-sensitive, in Proc. 5th ACM

SIGKDD Int. Conf. KDD, San Diego, CA, USA, 1999, pp. 155164.

[11] M. Dredze, K. Crammer, and F. Pereira, Confidence-weighted linear classification, in Proc. 25th

ICML, Helsinki, Finland, 2008, pp. 264271.

[12] C. Elkan, The foundations of cost-sensitive learning, in Proc.17th IJCAI, San Francisco, CA, USA,

2001, pp. 973978.

[13] Y. Freund and R. E. Schapire, Large margin classification using the perceptron algorithm, Mach.

Learn., vol. 37, no. 3,pp. 277296, 1999.

[14] C. Gentile, A new approximate maximal margin classification algorithm, J. Mach. Learn. Res., vol. 2,

pp. 213242, Dec. 2001.

[15] S. C. H. Hoi, R. Jin, P. Zhao, and T. Yang, Online multiple kernel classification, Mach. Learn., vol. 90,

no. 2, pp. 289316, 2013.

AUTHORS PROFILE

Mr. Bharat S. Jadhav received the BE degree in Information Technology from

Pravara Rural Engineering College in 2012. During 2013-2014, he stayed at Late

Hon.D.R.Kakade Polytechnic Pimpalwandi as lecturer in Computer Technology

Department,.Now he is currently working in Tikona Digital Networks as Network

Support Engg. Also he is pursuing Master Of Engineering in Sharadchandra Pawar

College of Engineering, Dumbarwadi,Otur, University Of Pune .

Dr. S.V.Gumaste, currently working as Professor and Head, Department of

Computer Engineering, SPCOE-Dumbarwadi, Otur. Graduated from BLDE

Association's College of Engineering, Bijapur, Karnataka University, Dharwar in

1992 and completed Post- graduation in CSE from SGBAU, Amravati in 2007.

Completed Ph.D (CSE) in Engineerng & Faculty at SGBAU, Amravati. Has around 22

years of Teaching Experience.

Documents

Application of Malicious URL Detection In Data Mining