An experimental comparison of naive bayesian and keyword based

Preview:

Citation preview

An Experimental Comparison of Naive Bayesian and Keyword-Based Anti-Spam Filteringwith Personal E-mail Messages

Author:

Ion Androutsopoulos , John Koutsias ,Konstantinos V. Chandrinos, Constantine D. Spyropoulos

Resourse: sigir2000

Outline Introduction Feature selection The Naive Bayesian classifier Result

Introduction

垃圾郵件很多 Naïve Bayesian classifier 與 keywork-based 的反垃圾郵

件機制做比較 . Sahami et al. trained a Naïve Bayesian classifier on

manually categorized legitimate and spare messages

The Naive Bayesian classifier

x = (xl , x2 , x 3 .... , xn ) , where xl ,….., xn are the values of attributes X 1 .... , X n .

Each attribute shows whether or not a particular word (eg. "adult") is present in the message.

Use additional attributes corresponding to phrases(e.g. "be over 21") .

Non-textual properties (e.g. whether or not the message contains attachments).

mutual information Use mutual information ( MI ) to select possible attributes. MI(X;C):

Then select the attributes with the highest mutual

information values.

The Naive Bayesian classifier

S -> L (legitimate to spam) L->S(spam to legitimate) denote the two error types.

we assume that L->S is times more costly than S -> L

Classify a message as spare if the following classification criterion is met:

= 999 (t=0.999) , This means that mistakenly blocking a legitimate message was taken to be as bad as letting 999 spare messages pass the filter.

= 9 (t=0.9) , 若郵件被 blocked 時 , 回傳給 sender道歉訊息以及猜謎 .

= 1(t=0.5), If the recipient does not care about the extra work imposed on the sender.

Result

1789 messages, consisting of 211 legitimate messages that users had saved and 1578 spare messages.

First experiment word-attributes were used. Candidate attributes were added (e.g. corresponding to the

phrases "be over 21", "only $"). Third experiment, (e.g. whether or not the message contains

attachments, or a high proportion of non alphanumericcharacters).

Experiments with the PU1 corpus 481 spam messages. 618 legitimate messages. Naive Bayesian classifier, ten-fold cross validation to reduce random variation. That Results were then averaged over the ten runs. varied the number of retained attributes from 50 to 700

by a step of 50 lemmatizer and stop-list

Recommended