Application of Malicious URL Detection In Data Mining

  • Upload
    ijafrc

  • View
    213

  • Download
    0

Embed Size (px)

DESCRIPTION

Recently, major computer attacks are launched by visiting a malicious webpage. In this paper wehave to construct a real-time system that uses machine learning techniques to detect malicious URLs(spam, phishing, exploits, and so on). So that, we have determine techniques that involve classifyingURLs based on their lexical and host-based features, as well as online learning to process largenumbers of examples and adapt quickly to evolving URLs over time. However, in a real-worldmalicious URL detection task, the ratio between the number of malicious URLs and legitimate URLs ishighly imbalanced, making it very inappropriate for simply optimizing the prediction accuracy.Besides, another limitation of the previous work is to assume a large amount of training data isavailable, which is impractical as the human labeling cost could be quite expensive. A user can betricked into voluntarily giving away confidential information on a phishing page or become victim toa drive-by download resulting in a malware infection. A malicious URL is a link pointing to a malwareor a phishing site, and it may then propagate through the victim's contact list. Moreover, hackersometimes might use social engineering tricks making malicious URLs hard to be identified. To solvethese issues, in this paper, we present a novel framework of Cost-Sensitive Online Active Learning(CSOAL).

Citation preview

  • International Journal of Advance Foundation and Research in Computer (IJAFRC)

    Volume 2, Issue 7, July - 2015. ISSN 2348 4853

    12 | 2015, IJAFRC All Rights Reserved www.ijafrc.org

    Application of Malicious URL Detection In Data Mining Mr. Jadhav Bharat S.*1, Dr. Gumaste S.V. *2.

    *1 M.E. student Department of Computer Engineering, Sharadchandra Pawar College of Engineering,

    Dumbarwadi, Otur, Pune, Maharashtra, India

    *2 Associate Professor And Head Department of Computer Engineering, Sharadchandra Pawar College of

    Engineering, Dumbarwadi , Otur, Pune, Maharashtra, India [email protected]*1 , [email protected]*2

    A B S T R A C T

    Recently, major computer attacks are launched by visiting a malicious webpage. In this paper we

    have to construct a real-time system that uses machine learning techniques to detect malicious URLs

    (spam, phishing, exploits, and so on). So that, we have determine techniques that involve classifying

    URLs based on their lexical and host-based features, as well as online learning to process large

    numbers of examples and adapt quickly to evolving URLs over time. However, in a real-world

    malicious URL detection task, the ratio between the number of malicious URLs and legitimate URLs is

    highly imbalanced, making it very inappropriate for simply optimizing the prediction accuracy.

    Besides, another limitation of the previous work is to assume a large amount of training data is

    available, which is impractical as the human labeling cost could be quite expensive. A user can be

    tricked into voluntarily giving away confidential information on a phishing page or become victim to

    a drive-by download resulting in a malware infection. A malicious URL is a link pointing to a malware

    or a phishing site, and it may then propagate through the victim's contact list. Moreover, hacker

    sometimes might use social engineering tricks making malicious URLs hard to be identified. To solve

    these issues, in this paper, we present a novel framework of Cost-Sensitive Online Active Learning

    (CSOAL).

    Index Terms : Malicious URL Detection, Cost-Sensitive Learning, Online Learning, Active Learning.

    I. INTRODUCTION

    The WWW allows people to access all information on the internet, but it also brings fake information, such as

    fake drug, malware, and so on. Criminal enterprises such as spam-advertised commerce (e.g., counterfeit

    watches or pharmaceuticals), financial fraud (e.g., via phishing) and as a vector for propagating malware

    (e.g., so-called drive-by downloads). [1][2]A user accesses all kinds of information (Trusted or Suspicious)

    on the Web by clicking on a URL (Uniform Resource Locator) that links to a particular website. It is thus

    important for internet users to find the risk of clicking a URL in order to avoid check accessing the malicious

    web sites.

    Although the exact adversary mechanisms behind web criminal activities may vary, they all try to lure users

    to visit malicious websites by clicking a corresponding URL (Uniform Resource Locator)[3]. The most

    motivational things behind these schemes may differ; the common thing among them is the requirement that

    unsuspecting users visit their sites.

  • International Journal of Advance Foundation and Research in Computer (IJAFRC)

    Volume 2, Issue 7, July - 2015. ISSN 2348 4853

    13 | 2015, IJAFRC All Rights Reserved www.ijafrc.org

    These visited sites can be driven by email. Web search results or links from other Web pages, but all require

    the user to take some action, such as clicking, that specifies the desired Uniform Resource Locator (URL).A

    URL is called malicious (also known as black) if it is created in a malicious purpose and leads a user to a

    specific threat that may become an attack, such as spyware, malware, and phishing. Malicious URLs are a

    major risk on the web. Therefore, detecting malicious URLs is an essential task in network security

    intelligence. If anyone could inform users before visiting that a particular URL was dangerous to visit, so the

    problem could be avoided.

    The security community has responded by developing blacklisting services. And that services provided to

    user that the particular site is malicious or not. These blacklists are constructed by extracting the features

    from the URL. And depending on the features classifier can divide them into white list and black list.

    Although, many Suspicious sites are not blacklisted either because they are recently launched, were never

    visited by user, or were checked incorrectly (e.g., due to cloaking)[4][5][6][7]. To address this problem,

    some client-side systems analyze the content or behavior of a Web site as it is visited. But, in addition to run-

    time overhead, these approaches can expose the user to the very browser-based vulnerabilities that we seek

    to avoid.

    In this paper, we focus on a complementary part of the design space: URL classificationthat is, classifying

    the reputation of a Web site entirely based on the URL. The motivation is to provide inherently better

    coverage than blacklisting based approaches (e.g., correctly predicting the status of new sites) while avoiding

    the client-side overhead and risk of approaches that analyze Web content on demand[8][9][10][11]. In

    particular, we explore the use of statistical methods from machine learning for classifying site reputation

    based on the relationship between URLs and the lexical and host-based features that characterize them.

    II. PROBLEM STATEMENT

    Our main purpose, behind that is we treat URL as a binary classification problem where true examples are

    malicious URLs and false examples are benign URLs. This approach to the problem can succeed if the

    distribution of extracted feature values for malicious examples is different from benign examples, the

    training set shares the same extracted feature distribution as the testing set.

    Also, we classify URLs based only on the relationship between URLs and the lexical and host-based features

    that characterize them, and we do not consider two other kinds of potentially useful sources of information

    for features: the URLs page information, and the content of the URL[12][13][14][15] (e.g., the URL which

    embedded the page or email).

    This information is very useful to improve classification accuracy, so we exclude it for following reasons.

    1. Avoiding downloaded page material is safer for the users.

    2. Classifying a URL with a trained model is a lightweight operation compared to first downloading the

    page and then using its contents for classification.

    3. Concentrating on URL features makes the classifier applicable to any context in which URLs are found

    (Web pages, email, chat, calendars, games, etc.), rather than dependent on a particular application

    setting.

    4. Reliably obtaining the malicious version of a page for both training and testing can become a difficult

    practical issue.

  • International Journal of Advance Foundation and Research in Computer (IJAFRC)

    Volume 2, Issue 7, July - 2015. ISSN 2348 4853

    14 | 2015, IJAFRC All Rights Reserved www.ijafrc.org

    Suspicious sites have show the ability to cloak the content of their Web pages, that is, showing different

    contents to different clients .For example, a Suspicious server may send benign versions of a page to honey

    pot IP addresses that belong to security practitioners, but send Suspicious versions to other clients[10][15].

    III. URL RESOLUTION

    URLs are human-readable text strings. Through a multistep resolution process, browsers translate each and

    every URL into instructions that locate the server hosting the site and specify where the site or resource is

    placed on that host[6][8][10][14]. Standard syntax of URL is

    [Protocol]:// [hostname][path]

    Fig. 1. Example of a Uniform Resource Locator (URL) and its components.

    Contents of Protocol

    1) Protocol:- This portion of the URL indicates which network protocol should be used to fetch

    the requested resource. The most usable protocols are Hypertext Transport Protocol or HTTP

    (http).In Figure 1, all of the example URLs specify the HTTP protocol[10].

    2) Path:-This portion of a URL is the path name of a file on a local computer. In Figure 1, the

    path in the example is /~jtma/url/example.html. The path tokens, delimited by various

    punctuation such as slashes, dots, and dashes, and it shows how the site is organized[11]

    3) Hostname :- This portion of URL is the identifier for the Web server. Mainly it is a machine-

    readable Internet Protocol (IP) address, but from the user perspective it is a human-readable

    domain name.In IPv4 addresses are 32-bit integers that are mainly represented as a dotted

    quad. In dotted quad notation, we divide the 32-bit address into four 8-bit bytes[15].

    IV. FEATURES OF URL

    In this paper we have proposed a system For detecting the malicious URL. To practically implement the

    proposed System and detect malicious URL we have taken total 700 website URL in that 500 websites URL

    are Real website URL i.e. without any malicious data in it. And remaining 200 website URL are malicious

    ones. But they are randomly placed in our dataset .So to detect which website URL is malicious URL, we have

    proposed different features of extraction they are length, Number of dots, TTL, get info.

    1) Length: - We find the length and on that basis of the length of the website URL we can detect that the

  • International Journal of Advance Foundation and Research in Computer (IJAFRC)

    Volume 2, Issue 7, July - 2015. ISSN 2348 4853

    15 | 2015, IJAFRC All Rights Reserved www.ijafrc.org

    URL is real or malicious one

    2) Number of full-stop: - We can also detect the URL is malicious or not on the basis of dots present in

    the whole URL

    3) TTL:- It stands for Time to live, Time-to-live (TTL) is a value in an Internet Protocol (IP) packet that

    tells a network router whether or not the packet has been in the network too long and should be

    discarded. For a number of reasons, packets may not get delivered to their destination in a

    reasonable length of time.

    4) Get Info:-In this parameter the information about the registrar is present , here detail inform of

    every website is present i.e. on Whose name the Website is registered , Complete details of that

    Registrar are present here . It is also one of the feature on the basis of which we can detect the URL is

    Malicious or not .

    5) Date: - is also one of the feature for detecting the URL is Malicious or not . Date on which the Website

    URL was launched can help in detecting the website is malicious or not

    6) Who is Connect:- It gives the information about the server i.e. when it is registered date of the

    registration name of the registrar , i.e. all kinds of information of website

    V. SYSTEM ARCHITECTURE

    Figure 1: Framework for malicious URL detection

    Primary target of this paper is to add to a framework which will manage the way that every time getting real

    class of the example is impractical and will consider the expense of the misclassification to upgrade the

    classifier in the event of endure misfortune. In proposed the online dynamic learning with cost sensitivity

    (ODLCS) which will primary target of proposed framework, which is expressed previously. The target of

    directed malicious URL discovery is to manufacture a prescient model that can unequivocally predict if an

  • International Journal of Advance Foundation and Research in Computer (IJAFRC)

    Volume 2, Issue 7, July - 2015. ISSN 2348 4853

    16 | 2015, IJAFRC All Rights Reserved www.ijafrc.org

    approaching URL sample is noxious or not[10][11][12][13][14]. If all else fails, this can be portrayed as a

    binary classification errand where malicious URL examples are from positive class ("+1") and typical URL

    occurrences are from negative ("-1"). For an online pernicious URL recognition responsibility, the objective is

    to make an online learner to incrementally assemble a arrangement model from a gathering of URL

    preparing information occasions by method for a online learning fashion. In particular, for every one

    adapting round, the learner first gets another approaching URL event for location; it then applies the

    classification model to anticipate in case it is malicious then again not; around the end of the adapting round,

    if reality class name of the sample can be revealed from the earth, the learner will make usage of the checked

    case to redesign the characterization model at whatever point the order is erroneous generally speaking, it is

    normal to apply web figuring out how to comprehend online malicious URL detection. In any case, it is

    unfeasible to explicitly apply a current online learning framework to settle these issues[4][5][6][8].

    This is by virtue of a schedule online classification undertaking typically acknowledge the class label of every

    approaching event will be revealed keeping in mind the end goal to be used to upgrade the classification

    model toward the end of every learning round. Plainly it is unfathomable or exceedingly rich if the learner

    queries the class name of every approaching event in an online malicious URL detection assignment. To

    address this test, in the proposed framework to research a novel system of ODLCS as demonstrated in Figure

    2. Generally speaking, the proposed ODLCS system tries to address two key troubles in a systematic and

    synergic learning philosophy:

    1. The learner must choose when it ought to query the class label of an approaching URL case.

    2. How to update the classifier in the best path where there is another marked URL event. VI. DATASET

    For the experiment we take dataset from http://sysnet.ucsd.edu/projects/url/. The original data set was

    created in purpose to make it somehow class-balanced. In suggested system to produce a separation by

    sampling from the original data set to make it close to a more realistic distribution scenario where the

    number of normal URLs is significantly larger than the number of malicious URLs.

    VII. CONCLUSION

    In this paper proposed a novel system of Online Dynamic Learning with Cost Sensitivity (ODLCS) to taking

    care of real-world applications in the classification domain like online malicious URL recognition

    undertaking. Also we extract the feature from the URL.By using this feature we classify them as positive and

    negative .After training the classifier new entry of URL is tested and classify that into the malicious and

    normal URL.

    VIII. REFERENCES

    [1] Jialei Wang, Peilin Zhao, and Steven C.H. Hoi, Member, IEEE, Cost-Sensitive Online Classification, VOL.

    26, NO. 10, OCTOBER 2014

    [2] Peilin Zhao, Steven C.H. Hoi School of Computer Engineering Nanyang Technological University 50

    Nanyang Avenue, Singapore 639798 Cost-Sensitive Online Active Learning with Application to

    Malicious URL Detection August 1114, 2013

    [3] R. Akbani, S. Kwek, and N. Japkowicz, Applying support vector machines to imbalanced datasets, in

    Proc. 15th ECML, Pisa, Italy, 2004, pp. 3950.

  • International Journal of Advance Foundation and Research in Computer (IJAFRC)

    Volume 2, Issue 7, July - 2015. ISSN 2348 4853

    17 | 2015, IJAFRC All Rights Reserved www.ijafrc.org

    [4] B. R. Bocka, Methods for multidimensional event classification: A case study using images from a

    Cherenkov gamma-ray telescope, Nucl. Instrum. Meth., vol. 516, no. 23, pp. 511-528, 2004.

    [5] G. Blanchard, G. Lee, and C. Scott, Semi-supervised novelty detection, J. Mach. Learn. Res., vol. 11, pp.

    29733009, Nov. 2010.

    [6] V. Chandola, A. Banerjee, and V. Kumar, Anomaly detection: A survey, ACM CSUR, vol. 41, no. 3,

    Article 15, 2009.

    [7] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, SMOTE: Synthetic minority over-

    sampling technique, J. Artif. Intell. Res., vol. 16, no. 1, pp. 321357, 2002.

    [8] K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer, Online passive-aggressive

    algorithms, J. Mach. Learn. Res., vol. 7, pp. 551585, Mar. 2006.

    [9] K. Crammer, M. Dredze, and F. Pereira, Exact convex confidence weighted learning, in Proc. NIPS,

    2008, pp. 345352.

    [10] P. Domingos, Metacost: A general method for making classifiers cost-sensitive, in Proc. 5th ACM

    SIGKDD Int. Conf. KDD, San Diego, CA, USA, 1999, pp. 155164.

    [11] M. Dredze, K. Crammer, and F. Pereira, Confidence-weighted linear classification, in Proc. 25th

    ICML, Helsinki, Finland, 2008, pp. 264271.

    [12] C. Elkan, The foundations of cost-sensitive learning, in Proc.17th IJCAI, San Francisco, CA, USA,

    2001, pp. 973978.

    [13] Y. Freund and R. E. Schapire, Large margin classification using the perceptron algorithm, Mach.

    Learn., vol. 37, no. 3,pp. 277296, 1999.

    [14] C. Gentile, A new approximate maximal margin classification algorithm, J. Mach. Learn. Res., vol. 2,

    pp. 213242, Dec. 2001.

    [15] S. C. H. Hoi, R. Jin, P. Zhao, and T. Yang, Online multiple kernel classification, Mach. Learn., vol. 90,

    no. 2, pp. 289316, 2013.

    AUTHORS PROFILE

    Mr. Bharat S. Jadhav received the BE degree in Information Technology from

    Pravara Rural Engineering College in 2012. During 2013-2014, he stayed at Late

    Hon.D.R.Kakade Polytechnic Pimpalwandi as lecturer in Computer Technology

    Department,.Now he is currently working in Tikona Digital Networks as Network

    Support Engg. Also he is pursuing Master Of Engineering in Sharadchandra Pawar

    College of Engineering, Dumbarwadi,Otur, University Of Pune .

    Dr. S.V.Gumaste, currently working as Professor and Head, Department of

    Computer Engineering, SPCOE-Dumbarwadi, Otur. Graduated from BLDE

    Association's College of Engineering, Bijapur, Karnataka University, Dharwar in

    1992 and completed Post- graduation in CSE from SGBAU, Amravati in 2007.

    Completed Ph.D (CSE) in Engineerng & Faculty at SGBAU, Amravati. Has around 22

    years of Teaching Experience.