IJTC201510012-Email With Classification Detection Power

Embed Size (px)

Citation preview

  • 8/17/2019 IJTC201510012-Email With Classification Detection Power

    1/7

      INTERNATIONAL JOURNAL OF TECHNOLOGY AND COMPUTING (IJTC) 

    ISSN-2455-099X,

    Volume 1, Issue 1, OCTOBER 2015.

    1

    EMAIL WITH CLASSIFICATION DETECTION POWER

    Pallavi a, ∗

    , Manpreet Virk b 

    a,b Department of CSE , GGS Modern Technology, Kharar

    ABSTRACT

    This research is to classify and filter the large amount of data. The main purpose of this research is to reduce the

    error rate of the data and to improve the accuracy. In the previous techniques of classification there may be

    some miss classification. But in this research the problem of misclassification is reduced. The work is presented

     by this research is some modifications in the classification technique. Therefore, it’s a good enterprise solution

    for filtering. This will optimize the system performance and make some improvements on the previous

    algorithm. This will give the better results from the previous one. 

     Keywords: error rate, Techniques, class labels spam and non-spam.

    I.  INTRODUCTION 

    Email filtering is the processing of email to systematize it according to the exact criteria.

    Most often this refers to the automatic processing of incoming messages, but the term is also

    used to the involvement of human intelligence in addition to anti-spam techniques. Bayesian

    spam filtering is a statistical method of e-mail filtering. Bayesian spam filtering makes use

    for Naive Bayes classifier to make out spam e-mail. Work is classified by Bayesian to

    compare the use of tokens i.e typically words, or we can say irregularly other things, with

    spam and non-spam e-mails. Bayesian spam filtering is a extremely powerful technique for

    constricting with spam, that can adapt itself to the email needs of individual users, and gives

    low false positive spam finding rates that are generally acceptable to users.

    A. Email Filtering Benefits 

    Deal with the Service:  Because our services are a managed explanation, there are no

    additional costs for software/hardware upgrades, Internet bandwidth, or labor for

    maintenance. Taking away of these expenses saves costs and eases budgeting by eliminating

    capital lay out and surprising expenses that are sometimes incurred through enforced

    upgrades of hardware- or software-based solutions. Specialized expertise is also offered on

    mail routing, filtering, and blacklists, allowing the staff to concentrate on the infrastructure.

  • 8/17/2019 IJTC201510012-Email With Classification Detection Power

    2/7

      INTERNATIONAL JOURNAL OF TECHNOLOGY AND COMPUTING (IJTC) 

    ISSN-2455-099X,

    Volume 1, Issue 1, OCTOBER 2015.

    2

    Improves Efficiency: These services recover staff's productivity by eliminating typically 98%

    of unwanted email. Current industry estimates designate that as much as 70% of all email is

    unwanted, wasting employees’ time manually filtering and deleting messages. Because we

    offer a large array of configuration options, filtering solution can be tailored to needs rather

    than changing the way to do business.

    Reduces Communications Load:  By filtering outside of the premises, one eliminates the

    requirement for infrastructure to deal with the email messages that are filtered out, save

    Internet bandwidth and server load. This can be facilitated to extend the useful life of assets

     by deferring ability issues.

    Mitigates Liability: Elimination of nasty content from mail stream completely reduces the

    chances of "hostile place of work " lawsuits from employees. Although filtering can never be

    100% correct the positive defensive actions of utilizing a managed service demonstrates a

    good-faith attempt to protect workers to the highest degree technology permits.

    Increases Safety measures:  Because filtered email never enters infrastructure, reduce the

    exposure to virus and other "malware". Since one act as mail agent, ones own equipment no

    longer needs to be registered or usually visible on the Internet, eliminating the risk of hacking

    and other malicious actions.

    Avoids Investment: Unlike purchased hardware or software solutions, managed service has

    no investment. The only overheads incurred are month-to-month fees, with no reduction, no

    capital outlay, and no upholding contracts.

    Entirely Compatible: Services are based on standard Internet protocols and will interoperate

    with any new Internet mail infrastructure. The risk is avoided of compatibility issues when

    replacing or upgrading other components of infrastructure, unlike solutions based on

    software that are only supported with specific mail agents or operating systems.

    Improves Reliability:  In this the incoming mail is stored during periods of disruption of

    local Internet server unavailability and re-delivers when service is again available. This

    avoids pointless instances of mail returned to the sender due to local problems. By delivering

    outgoing mail, problems will be avoided of server blacklisting due to dynamic addresses on

    cable and DSL networks and violence from other subscribers of Internet service provider.

    Simply examine the latest anti-spam filtering techniques and hit upon ways how to cut them,

    usually done by simply change the message a little. This gave anti spam developers a new

    challenge come up with a new anti spam technique; one that was familiar with spammers’

    http://www.allspammedup.com/anti-spam/http://www.allspammedup.com/anti-spam/

  • 8/17/2019 IJTC201510012-Email With Classification Detection Power

    3/7

      INTERNATIONAL JOURNAL OF TECHNOLOGY AND COMPUTING (IJTC) 

    ISSN-2455-099X,

    Volume 1, Issue 1, OCTOBER 2015.

    3

    tactics as they vary over time, and that is capable to adapt to the particular organization that it

    is protecting from spam. There are different emails filtering methods.

    1) Blacklist:  Blacklist comes under the list based filters. This is spam filtering method

    attempts to stop unwanted email by blocking messages from the list of sender. Blacklist

    contains the records of email addresses. In this when in coming message arrives, the spamfilter checks to see if its IP or email address is on the blacklist. Then it considers the message

    as a spam and then reject it.

    2) Whitelist:  Whitelist blocks spam using a system almost exactly opposite to that of

     blacklist. In this if an unknown sender’s email address is checked against the database, if they

    have no history of spamming, their message is sent to inbox and then they added to the

    whitelist.

    3) Word based filtering: Word based filtering comes under the content based filtering it is

    the simplest form of filtering .word based filtering is the capable technique for fighting junk

    email. For example, if the filter has been set to stop all messages containing the word “acbd”.

    But spammers often purposefully misspell keywords in order to evade word based filtering

    and this is the main problem in this type of filtering.

    4) Bayesian filters: Bayesian filters technique is the most advance content based technique.

    It employs the laws of mathematical probability to settle on which message are real and

    which message is spam. In this, filter takes words and phrases finding legitimate mails ad

    adds them to the list. This method acquires a training time period before it starts running well.

    There are other filtering methods like challenge/response system, collaborative filters.

    Bayesian spam filtering is the process of using a naive Bayes classifier to identify spam e-

    mail. It is depended on the principle that most events are dependent and that the probability

    of an event occurring in the future can be inferred from the previous occurrences of that

    event. This similar method can be used to classify spam. If some content of text helds often in

    spam but not in legitimate mail, then it would be reasonable to predict that this email is

    almost certainly spam.

    II. LITERATURE REVIEW

    Xiaoming JIN, Yuchang LU et al (2003)  Index structure that enables efficient similarity

    queries in high-dimensional space is crucial for many applications. This paper discusses the

    indexing problem in dataset composed of partially clustered data, which exists in number of

    applications. Existing index methods are inefficient with incompletely clustered datasets. The

    dynamic and adaptive index formation presented here, called a multi-cluster tree (MC-tree),

    http://www.allspammedup.com/2009/01/bayesian-spam-filtering-with-exchange-server-2007/http://www.statsoft.com/textbook/stnaiveb.htmlhttp://www.statsoft.com/textbook/stnaiveb.htmlhttp://www.allspammedup.com/2009/01/bayesian-spam-filtering-with-exchange-server-2007/

  • 8/17/2019 IJTC201510012-Email With Classification Detection Power

    4/7

      INTERNATIONAL JOURNAL OF TECHNOLOGY AND COMPUTING (IJTC) 

    ISSN-2455-099X,

    Volume 1, Issue 1, OCTOBER 2015.

    4

    consists of a set of height-balanced trees for indexing. This index structure improves the

    querying efficiency in three ways:

    1) Most bounding regions achieve uniform distributions, which results in fewer splits and less

    overlap compared with a single indexing tree.

    2) The clusters in the dataset are with dynamism detected when the index is updated.

    3) The query process does not involve a sequential scan. The MC-tree was shown to be better

    than hierarchical and cluster-based indexes for the partially clustered datasets.

    This paper presents an index structure for partially clustered datasets which constitute a large

     portion of data stored in current information systems. The goal was to make the index

    respond efficiently to both clustered and uniform data in one database and to perform queries

    on it without losing precision and recall. This index structure improves the query efficiency

    in the following ways:

    1) Index only the non-clustered data in the main index. It ensures that the main index has

    fewer overlaps and splitting compared with a single indexing tree.

    2) The clusters in the dataset are dynamically detected when the index is updated, which

    ensures the index adaptive and keeps the index from the decrease of performance.

    3) During the query process, each data point is retrieved from a hierarchical index, so

    sequential scans are not required. Uniform data and partially clustered data were used to

    evaluate the performance of MC-tree. The results verified that MC-tree outperformed the

    common hierarchical indexes and cluster-based indexes for the partially clustered dataset.

    Hovold Johan (2004)  in this research, the use of the naive bayes classifier as the basis for

     personalised spam filters is explored. According to this paper, the several machine learning

    algorithms are explored already, they were included variants of naive bayes, but in this

     proposal the author used word position based attribute vectors, through which very good

    results are given when they tested on several publically available corpora.

    III. RESULTS

  • 8/17/2019 IJTC201510012-Email With Classification Detection Power

    5/7

      INTERNATIONAL JOURNAL OF TECHNOLOGY AND COMPUTING (IJTC) 

    ISSN-2455-099X,

    Volume 1, Issue 1, OCTOBER 2015.

    5

    Fig. 1: GUI of Work

    Fig. 2: Scattering of the dataset on the basis of the class labels spam and non-spam

  • 8/17/2019 IJTC201510012-Email With Classification Detection Power

    6/7

      INTERNATIONAL JOURNAL OF TECHNOLOGY AND COMPUTING (IJTC) 

    ISSN-2455-099X,

    Volume 1, Issue 1, OCTOBER 2015.

    6

    Fig.3: Classification plotted using Naive Bayes Kernel

    Fig.4: Plotting the best choice

    IV. CONCLUSION

    Email is method of exchanging digital messages from source to destination. The exchange of

    messages from an author to one or more. Email messages can be text files, graphics images

    and sound files. Email messages are usually encoded in the ASCII text. Spam or unsolicited

    e-mail has become a major problem for companies and private users. This paper explored the

  • 8/17/2019 IJTC201510012-Email With Classification Detection Power

    7/7

      INTERNATIONAL JOURNAL OF TECHNOLOGY AND COMPUTING (IJTC) 

    ISSN-2455-099X,

    Volume 1, Issue 1, OCTOBER 2015.

    7

    various problems associated with spam and different methods and techniques attempting to

    deal with it. From the study we identified that, many of the filtering techniques are based on

    text categorization methods and there is no technique can claim to provide an ideal solution

    with 0% false positive.

    REFERENCES

    [1] Han, J., Kamber, M., & Pei, J. (2006). Data mining: concepts and techniques. Morgan kaufmann.

    [2] Jiawei, H., &Kamber, M. (2001). Data mining: concepts and techniques. San Francisco, CA, itd: Morgan

    Kaufmann, Data, C. H. D. (2010). Data Mining: Concepts and Techniques.

    [3] Androutsopoulos, I., Koutsias, J., Chandrinos, K. V., &Spyropoulos, C. D. (2000, July). An experimental

    comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages. In

    Proceedings of the 23rd annual international ACM SIGIR conference on Research and development ininformation retrieval (pp. 160-167). ACM.

    [4] Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Sakkis, G., Spyropoulos, C. D., &Stamatopoulos, P.

    (2000). Learning to filter spam e-mail: A comparison of a naive bayesian and a memory-based approach. arXiv

     preprint cs/0009009.

    [5] Basavaraju, M., &Prabhakar, R. (2010). A novel method of spam mail detection using text based clustering

    approach. International Journal of Computer Applications, 5(4).

    [6] Hovold, J. (2005, July). Naive bayes spam filtering using word-position-based attributes. In Proceedings of

    the 2nd Conference on Email and Anti-Spam (CEAS 2005).

    [7] Jin, X., Wang, L., Lu, Y., & Shi, C. (2003). MC-tree: Dynamic index structure for partially clustered multi-

    dimensional database. Tsinghua Science and Technology, 8(2), 174-180.

    [8] Liu, P. Y., Zhang, L. W., & Zhu, Z. F. (2009). Research on e-mail filtering based on improved Bayesian.

    Journal of Computers, 4(3), 271-275.

    [9] Rajput, A., &Toshniwal, D. Adaptive Spam Filtering based on Bayesian Algorithm.

    [10] Rennie, J. (2000, August). ifile: An application of machine learning to e-mail filtering.In Proc. KDD 2000

    Workshop on Text Mining, Boston, MA.

    [11] Sahami, M., Dumais, S., Heckerman, D., & Horvitz, E. (1998, July). A Bayesian approach to filtering junk

    e-mail. In Learning for Text Categorization: Papers from the 1998 workshop (Vol. 62, pp. 98-105).

    [12] Song, Y., Kołcz, A., & Giles, C. L. (2009). Better Naive Bayes classification for high ‐ precision spam

    detection. Software: Practice and Experience, 39(11), 1003-1024.

    [13] TIAN Jinlan, ZHANG Suqin, ZHU Lin, LIU Lu. (2005). Improvement and Parallelism of k-Means

    Clustering Algorithm. Department of Computer Science and Technology, Tsinghua University, Beijing 100084,

    China, 10(3), 277-281.