Rajesh M. Pampapathi, Boris Mirkin, Mark Levene rajesh ... · Buy cheap medications online, no prescription needed. We have Viagra, Pherentermine, Levitra, Soma, Ambien, Tramadol

arX

iv:c

s/05

0303

0v1

[cs.

AI]

14

Mar

200

5 A Suffix Tree Approach to Text Categorisation Applied to

Spam Filtering

Rajesh M. Pampapathi, Boris Mirkin, Mark Levene

{rajesh, mirkin, mark}@dcs.bbk.ac.uk

February 8, 2020

Abstract

We present an approach to textual classification based on thesuffix tree data structure and

apply it to spam filtering. A method for scoring of documents using the suffix tree is developed

and a number of scoring and score normalisation functions are tested. Our results show that the

character level representation of documents and classes facilitated by the suffix tree significantly

improves classification accuracy when compared with the currently popular naive Bayesian filter-

ing method.

1 Introduction

Just as email traffic has increased over the years since its inception, so has the proportion that is

unsolicited; some estimations have placed the proportion as high as 60%, and the average cost

of this to business at around $2000 per year, per employee (see [19] for a range numbers and

statistics on spam). Unsolicited emails, commonly know asspam, have become a daily feature

of every email user’s inbox. Regardless of advances in emailfiltering, spam continues to be

a problem in a similar way to computer viruses which constantly reemerge in new guises. This

leaves the research community with the task of continually investigating new approaches to sorting

the welcome emails (known asham) from the unwelcome spam.

We present just such an approach to email classification and filtering based on a well studied

data structure, the suffix tree (see [10] for a brief introduction). The approach is similar to many

existing ones, such as those based on naive Bayes, in that it uses training examples to construct

1

http://arxiv.org/abs/cs/0503030v1

profiles of the spam and ham classes, and then compares new examples to each profile to decide

its class, but it differs in the depth and extent of the comparisons made. For a good overview of a

number of text classification methods, including naive Bayes, see [17], [1].

Using a suffix tree, we are able to compare not only single words, as in most current ap-

proaches, but substrings of an arbitrary length. The approach is far more resource hungry than

simpler approaches, but is more effective. The literature on suffix trees deals extensively with

improving (reducing) their resource demands; we provide references for readers interested in the

performance of algorithms for tree construction and searching ([18], [5], [8]), but we do not ad-

dress the issues ourselves in this paper.

We argue that comparisons of substrings (at the level of characters) has particular benefits

in the domain of spam classification because of the methods spammers use to evade filters. For

example, they may disguise the nature of their messages by interpolating them with meaningless

characters, thereby fooling filters based on keyword features into considering the words, sprinkled

with random characters, as completely new and unencountered. If we instead treat the words as

character strings, and not features in themselves, we are still able to recognise the substrings, even

if the words are broken.

Section 2 looks at the the different methods spammers use to evade detection, drawing atten-

tion to the sorts of spam which make it useful to consider character level features. Section 3 gives a

brief explanation of the naive Bayes method of text classification as an example of a conventional

approach. Section 4 briefly introduces suffix trees, with some definitions and notations which are

useful in the rest of the paper, before going on to explains how the suffix tree is used to classify

text and filter spam. Section 5 describes our experiments, the test parameters and details of the

data sets we used. Section 6 presents the results of the experiments and provides a comparison

with results in the literature. Section 7 concludes.

2 Types of Spam

Spam messages typically advertise a variety of products or services ranging from prescription

drugs or cosmetic surgery to sun glasses or holidays. But regardless of what is being advertised,

one can distinguish between types based on the methods used by the spammer to evade detection.

These methods have evolved with the filters which attempt to extirpate them, so there is a gener-

ational aspect to them, with later generations becoming gradually more common and earlier ones

fading out; as this happens, earlier generations of filters become less effective.

We present five examples of spam messages, the first of which illustrates undisguised spam

while the other four illustrate one or more methods of evasion. We are here only concerned to

2

highlight some features which will aid an understanding of why a character level consideration of

the content of the message helps to identify its nature. For afuller and more general discussion

and characterisation of spam, see [4], [6].

1. Undisguised message.

Buy cheap medications online, no prescription needed.We have Viagra, Pherentermine, Levitra, Soma, Ambien, Tramadol and manymore products.No embarrasing trips to the doctor, get it delivered directly to your door.

Experienced reliable service.

Most trusted name brands.

Your solution is here: http://www.webrx-doctor.com/?rid=1000

The example contains no obfuscation. The content of the message is easily identified by

filters, and words like ”Viagra” allow it to be recognised as spam. Such messages are very

likely to be caught by the simplest word-based Bayesian classifiers.

2. Intra-word characters.

Get the low.est pri.ce for gen.eric medica.tions!Xa.n.ax - only $100Vi.cod.in - only $99Ci.al.is - only $2 per do.seLe.vit.ra - only $73Li.pit.or - only $99Pr.opec.ia - only $79Vi.agr.a - only $4 per do.seZo.co.r - only $99

Your Sav.ings 40% compared Average Internet Pr.ice!

No Consult.ation Fe.es! No Pr.ior Prescrip.tions Required! No Appoi.ntments!No Wait.ing Room! No Embarra.ssment! Private and Confid.ential! Disc.reetPacka.ging!

che ck no w<http://priorlearndiplomas.com/r3/?d=getanon>

3

The example above shows the use of intra-word characters, which may be non-alpha-

numeric or whitespace. Here the word, ”Viagra” has become ”Vi.agr.a”, while the word

”medications” has become ”medica.tions”. To a simple word-based Bayesian classifier,

these are completely new words, which might have occurred rarely, or not at all, in previous

examples. Obviously, there are a large number of variationson this theme which would

each time create an effectively new word which would not be recognised as spam content.

However, if we approach this email at the character level, wecan still recognise strings such

as ”medica” as indicative of spam, regardless of the character that follows, and furthermore,

though we do not deal with this in the current paper, we might implement a look-ahead

window which attempts to skip (for example) non-alphabeticcharacters when searching for

spammy features.

3. Word salad.

Buy meds online and get it shipped to your door Find out more here<http://www.gowebrx.com/?rid=1001>

a publications website accepted definition. known are can Commons thebe definition. Commons UK great public principal work Pre-Budget but ancan Majesty’s many contains statements statements titles (eg includes havewebsite. health, these Committee Select undertaken described may publications

The example shows the use of what is sometimes called aword salad- meaning a random

selection of words. The first two lines of the message are its real content; the paragraph

below is a paragraph of words taken randomly from what might have been a government

budget report. The idea is that these words are likely to occur in ham, and would lead

to a traditional algorithm classifying this email as such. Again, approaching this at the

character level can help: Say we consider strings of length 8, strings such as ”are can” and

”an can”, are unlikely to occur in ham, but the words ”an”, ”are” and ”can” may occur quite

frequently. Of course, in most ’bag-of-words’ implementations, words such as these are

pruned from the feature set, but the argument still holds forother bigrams.

4. Embedded message (also contains a word/letter salad).

The example below shows anembeddedmessage. Inspection of it will reveal that it is

actually offering prescription drugs. However, there are no easily recognised words, except

those that form the word salad, this time taken from what appear to be dictionary entries

4

under ’z’. The value of substring searching is highly apparent in this case as it allows us

to recognise words such as ”approved”, ”Viagra” and ”Tablets”, which would otherwise be

lost among the characters pressed up against them.

zygotes zoogenous zoometric zygosphene zygotactic zygoidzucchettoszymolysis zoopathy zygophyllaceous zoophytologist zygomaticoauricularzoogeologist zymoid zoophytish zoospores zygomaticotemporal zoogonous zy-gotenes zoogony zymosis zuza zoomorphs zythum zoonitic zyzzyva zoophobeszygotactic zoogenous zombies zoogrpahy zoneless zoonic zoom zoosporiczoolatrous zoophilous zymotically zymosterol

FreeHYSHKRODMonthQGYIHOCSupply.IHJBUMDSTIPLIBJTJUBIYYXFN

* GetJIIXOLDViagraPWXJXFDUUTabletsNXZXVRCBX

<http://healthygrow.biz/index.php?id=2>

zonally zooidal zoospermia zoning zoonosology zooplankton zoochemicalzoogloeal zoological zoologist zooid zoosphere zoochemical

& Safezoonal andNGASXHBPnatural

& TestedQLOLNYQandEAVMGFCapproved

zonelike zoophytes zoroastrians zonular zoogloeic zoris zygophore zoograftzoophiles zonulas zygotic zymograms zygotene zootomical zymes zood-endrium zygomata zoometries zoographist zygophoric zoosporangiumzygotes zumatic zygomaticus zorillas zoocurrent zooxanthella zyzzyvaszoophobia zygodactylism zygotenes zoopathological noZFYFEPBmas<http://healthygrow.biz/remove.php>

5. HTML message.

The final example uses HTML, which serves both as a means of disguising the content from

an automatic filter and as a way of catching the attention of the reader. There are words,

such as ”Drugs”, which will very likely identify the email asspam, but there are also others

which might obscure or confuse a simple filter. Again, treating this as a string of characters

means we don’t need to do any HTML parsing, and indeed, the presence of the HTML

automatically (without any special treatment from us) weights this email towards spam, as

most ham will not contain such tagging.

5

abhorredleasehold backspace

pont trepidation although pidgin cession dovekie downgrade

acidic crucify commodity ship spindle gorse miscellaneous

definite bluster centennial simpleminded wad eigenspace

dock<center>Cheapest And Best Drugs On The

Internet Quick Ordering Fast And Free Shipping

- We Have THem ALL 

GDCYKTVARHSILDYSUVA

 Order Now 

OOFKDWWKBZHGIPXVCCJ

 Put this link into your AOL browser 

JZKZMTXIMFOJYYCQYAMUT

 Overnight Deliver For --Xanax-- 

 http://www.let3234drugs.biz/cf671 

RHPTJXOMAVHRBVIDRQQP

 </center>

carabao offsaddle candidate quiver~gourd pillow clump

protect marcy flip asinine confer aleck~infectious

perpetuate fescue amply crossover ligament byron curtis



These examples are only a sample of all the types of spam that exist, and our categories are

only one of a number of possible ways to divide the types; for an excellent and often updated list

of examples and categories, see [6]. Our categories are generally acknowledged in the literature

(for example [14], [21]), but there is little direct treatment. Under the categories suggested in [21],

examples 2 and 5 would count as ’Tokenisation’ and/or ’Obfuscation’, while examples 2 and 4

would count as ’Statistical’.

We believe that these sorts of spam are more indicative of current kinds and therefore deserve

special attention, and the conventional ’bag-of-words’ approaches are not the most appropriate

for the classification of such text. We look next at a bag-of-words approach, naive Bayes, before

considering the suffix tree approach.

3 Naive Bayesian Classification

Naive Bayesian (NB) email filters currently attract a lot of research and commercial interest, and

have proved highly successful at the task; [16] and [14] are both excellent studies of this approach

6

to email filtering. We do not give detailed attention to NB as it is not the intended focus of this

paper. For a general discussion of NB see [9], and for more context in text categorisation and

further references, the reader is directed to [17].

Due to the popularity and established success of NB, we decided to use it as a comparison

to the performance of the suffix tree approach. Our implementation of NB does not use all the

optimisation techniques which are covered in the literature – for example, we use no specially

crafted rules to improve performance – but we do some standard pre-processing including stop-

word removal and word stemming. And as we actually use no pre-processing at all in our suffix

tree approach (see the next section), we considered it safe to make comparisons between the two

methods where possible. In any case, the main intention was to use NB as a benchmark when

comparing the change in performance that occurs with different corpora and mixes of spam to

ham in our investigations of the behaviour of our suffix tree approach.

A naive Bayes classifier begins with a set of training examples with each example document

already assigned to one of a fixed set of possible classes,C = {c1, c2,c3,... cJ}. The naive

Bayes classifier then works by calculating the probability of each class given the document’s

features, and assigning new documents to the class which exhibits the highest probability. Each

text is represented as a vector of word frequencies,d, with all stop-words, and the most and least

frequently occurring words, removed. We use word frequencies as estimates of their probabilities

and use Bayes theorem to estimate the probability of a class,c j , given documentd:

P(c j | d) =P(c j)P(d | c j)

P(d)(1)

Assuming that words are independent, this leads to:

P(c j | d) =P(c j)∏M

i=1P(di | c j)

P(d)(2)

We estimate P(cj ) as:

P(C= c j) =Nj

N(3)

and P(di | c j ) as:

P(di | c j) =1+Nij

M+∑Mk=1Nkj

(4)

whereNi j is the number of times wordi occurs in classj (similarly for Nk j) andM is the total

number of words considered, and so is also the size of the document vectors,d.

To classify a document we calculate two scores, for spam and ham, and take the ratio,hsr=

hamScore/spamScore, and classify the document as ham if it is above a threshold,th, and as

7

spam if it is below (see section 5.1.3). An ’uncertain’ category is sometimes used, but for reasons

to come, we did not feel it was necessary in our final experiments (see section 6.4).

4 Suffix Tree Classification

4.1 Introduction

The suffix tree is a data storage and fast search technique which is commonly used in fields such

as computational biology for applications such as string matching applied to DNA sequences [3],

[11]. To our knowledge it has never been used in the domain of natural language text classification.

We adopted a conventional procedure for using a suffix tree intext classification. As with NB,

we take a set of documentsD which are each known to belong to one class,c j , in a set of classes,

C, and build one tree for each class. Each of these trees is thensaid to represent (or profile) a class

(a tree built from a class will be referred to as a ”class tree”).

When we have a new documentdN, we score it with respect to each of the class trees; the

class of the highest scoring tree is taken as the class of the document.

Thus, the major challenge to be addressed is the developmentof adequate and appropriate

methods for the scoring of strings against class trees. But first, we consider the construction of

the class tree.

4.2 Suffix Tree Construction

We provide a brief introduction to suffix tree construction.The method we describe is naive and

straight forward in order to allow the basic concept to be more easily grasped. For a more detailed

treatment, along with algorithms to improve computationalefficiency, the reader is directed to

[7]. Our representation of a suffix tree differs from the literature in two ways that are specific

to our task: first, we label nodes and not edges, and second, wedo not use a special terminal

character. The former has little impact on the theory and allows us to associate frequencies directly

with characters and substrings. The later is simply becauseour interest is actually focused on

substrings rather than suffixes; the inclusion of a terminalcharacter would therefore not aid our

algorithms, and its absence does not hinder them. Furthermore, our trees are depth limited, and

so the inclusion of a terminal character would be meaningless in most situations.

Suppose we want to construct a suffix tree from the string, s = ”MEET”. The string has four

suffixes: s1 = ”MEET”, s2 = ”EET”, s3 = ”ET”, and s4 = ”T”.

We begin at the root of the tree and create a child node for the first character of the suffix

s1. We then move down the tree to the newly created node and create a new child for the next

8

1 T

1 T

1 E

1 T

1 M

1 E

1 T

1 E

2 E

Level 0

Level 1

Level 2

Level 3

Level 4

root

Figure 1: A Suffix Tree after the insertion of ”MEET”.

character in the suffix, repeating this process for each of the characters in this suffix. We then take

the next suffix, s2, and, starting at the root, repeat the process as with the previous suffix. At any

node, we only create a new child node if none of the existing children represent the character we

are concerned with at that point. When we have entered each ofthe suffixes, the resulting tree

looks like that in Figure 1. Each node is labelled with the character it represents and its frequency.

The node’s position also represents the position of the character in the suffix, such that we can

have several nodes labelled with the same character, but each child of each node (including the

root) will carry a character label which is unique among its siblings. If we then enter the string, t =

”FEET”, into the tree in Figure 1, we obtain the tree in Figure2. The new tree is almost identical

in structure to the previous one because the suffixes of the two strings are all the same but for t1 =

”FEET”, and as we said before, we need only create a new node when an appropriate node does

not already exist, otherwise, we need only increment the frequency count.

Thus, as we continue to add more strings to the tree, the number of nodes in the tree increases

only if the new string contains substrings which have not previously been encountered. It follows

that given a fixed alphabet and a limit to the length of substrings we consider, there is a limit to the

size of the tree. Practically, we would expect that, for mostclasses, as we continue to add strings

to the class tree, the tree will increase in size at a decreasing rate, and will quite likely stabilise.

9

2 T

2 T

2 E

1 T

1 M

1 F

1 E

2 T

1 E

4 E

1 T

1 E

1 E

Level 1

Level 2

Level 3

Level 4

Level 0

root

Node n

Node p

Figure 2: A Suffix Tree after insertion of strings MEET and FEET.

4.3 Class Trees and their Characteristics

For any strings we designate theith character ofs either bys[i] or si ; the suffix ofs beginning at

the ith character bys(i); and the substring from theith to the jth character bys(i, j).

Any node,n, labelled with a character,c, is uniquely identified by the path from theroot to n.

For example, consider the tree in Figure 2. There are severalnodes labelled with a ”T”, but we can

distinguish between noden = (”T” given ”MEE”) = (T|MEE) and p = (”T” given ”EE”) =

(T|EE). These nodes are labelledn andp in Figure 2. We say that the path ofn is−→P n = ”MEE”,

and the path ofp is−→P p = ”EE”; furthermore, the frequency ofn is 1, whereas the frequency ofp

is 2; and sayingn has a frequency of 1, is equivalent to saying the frequency of”T” given ”MEE”

is 1, and similarly forp.

If we say that theroot node,r, is at level zero in the tree, then all the children ofr are at level

one. More generally, we can say that the level of any node in the tree is one plus the number of

letters in its path. For example,level(n) = 4 andlevel(p) = 3.

The set of letters forming the first level of a tree is the alphabet - meaning that all the nodes of

the tree are labelled with one of these letters. For example,considering again the tree in Figure 2,

its first level letters are the set,Σ = {m,e, t, f}, and all the nodes of the tree are labelled by one of

these.

Suppose we consider a class,C, containing two strings (which we might consider as docu-

10

ments), s = ”MEET” and t = ”FEET”. Then we can refer to the tree in Figure 2 as the class tree

of C, or the suffix tree profile ofC; which we denote byTC.

The size of the tree,|TC|, is the number of nodes it has, and it has as many nodes asC has

unique substrings. For instance, in the case of the tree in Figure 2:

UC= uniqueSubstrings(C) = {meet,mee,me,m,eet,ee,e,et, t, f eet, f ee, f e, f}

|UC|= |uniqueSubstrings(C)|= 13

|TC|= numberOfNodes(TC) = 13

The number of all occurrences of substrings inC on the other hand might be called the number

of substring items (or tokens) in the class,C:

AC= allSubstrings(C) =

meet,mee,me,m,eet,ee,e,et,e, t,

f eet, f ee, f e, f ,eet,ee,e,et,e, t

;

As an example, note that the four ”e”s in the set are in fact thesubstrings s(1,1), s(2,2), t(1,1)

and t(2,2).

Furthermore, as each node in the tree,TC, represents one of the substrings in UC, the size of

the class, AC, is equal to the sum of the frequencies of nodes in the treeTC.

|AC|= |allSubstrings(C)|= sumOfFrequencies(TC) = 20

In a similar way, the suffix tree allows us to read off other frequencies very quickly and

easily. For example, if we want to know the number of characters in the classC, we can sum

the frequencies of the nodes on the first level of the tree; andif we want to know the number of

substrings of length 2, we can sum the frequencies of the level two nodes; and so on.

This also allows us to very easily estimate probabilities ofsubstrings of any length (up to the

depth of the tree), or of any nodes in tree. For example, we cansay from the tree in Figure 2, that

the probability of a substring, u, of length two having the value,u= ”ee”, given the classC, is the

frequency (f ) of the noden= (E|E), divided by the sum of the frequencies of all the level two

nodes in the treeTC:

11

estimatedProbabilityOfString(u) = ps(su) =f (u)

∑i∈Nu f (i)(5)

whereNu is the set of all nodes at same level asu.

Similarly one can estimate the conditional probability of uas the frequency of u divided by the

sum of the frequencies of all the children of u’s parent:

estimatedProbabilityOfChar(u) = pc(cu) =f (u)

∑i∈nu f (i)(6)

wherenu is the set of all children ofu’s parent.

Throughout this paper, whenever we mention ˆp(u), we mean the second of these (6): the

conditional probability of a nodeu. Once the tree is built, one can calculate any such values

on-the-fly.

4.4 Classification using Suffix Trees

4.4.1 Criterion of Classification

Our basic approach to classification using a suffix tree is similar to most conventional approaches.

First, we create a tree for each class using a training set. Wethen compare new examples - which

we wish to classify - to each of the class trees. The example isscored with respect to each class

and is assigned to the class against which it gains the highest score.

In practice, particularly in the current domain of spam filtering, in which we have just two

classes (spam and ham), and just as we do with in naive Bayes approach, we calculate two scores

and take the ratio,hsr= hamScore/spamScore. The example is then classified as ham if the ratio

is above a certain threshold, th, and as spam if it is below. This threshold (see section 5.1.3)

allows us to bias the classifier in favour of ham (to avoid false positives) or spam (to avoid false

negatives). Note that taking the ratio as above, rather thanits inverse, actually already biases the

classifier in favour of ham.

4.4.2 Scoring

In contrast to the conventional ’bag of words’ approach employed in both naive Bayes and linear

classifiers, the suffix tree approach involves much richer structures, with a document represented

as strings, rather than independent words, and a class represented by a tree. These structures need

be taken into account when scoring a match between a documentand a class.

12

In this section, we first describe how to score a match betweena string and a class, then extend

this to encompass document scoring, and finally show how we can take into account the different

characteristics of the class and its tree,T, such as class size and tree density,density(TC), which

we define as the mean number of children of all nodes except leaf nodes, which would anyway

always have zero children, and the root. To calculate it, we sum the number of children of each

internal node and divide by the total number of internal nodes..

1. Scoring a match

(a) A stringshas a matchm= m(s,T) in a treeT if there exists inT a path−→P = m, where

m is a prefix ofs (note that a path is always considered to begin at the root).

(b) The score,score(m), for a matchm= m0m1m2...mn, has two parts, firstly, the scoring

of each character,mi , with respect to its conditional probability, using asignificance

function of probability,φ [p] (see also, part(1c)), and secondly, the adjustment (nor-

malisation),v(m|T), of the score for the whole match with respect to its probability in

the tree:

score(m) = ν(m|T)n

∑i=0

φ [p(mi)] (7)

(c) A function of probability,φ [p], is employed as asignificance functionbecause it is not

always the most frequently occurring terms or strings whichare most indicative of a

class. For example, this is the reason that conventional pre-processing removes all stop

words, and the most and least frequently occurring terms; however, by removing them

completely we give them no significance at all, when we might instead include them,

but reduce their significance in the classification decision. Functions on the probability

can help to do this, especially in the absence of all pre-processing, but that still leaves

the question ofhow to weight the probabilities, the answer to which will dependon

the class.

In the spam domain, some strings will occur very infrequently (consider some of

the strings resulting from intra-word characters in the examples of spam above) in

either the spam or ham classes, and it is because they are so infrequent that they are

indicative of spam. Therefore, under such an argument, rather than remove such terms

or strings, we should actuallyincreasetheir weighting.

Considerations such as these led to experimentation with a number of specifications

13

of the significance function,φ [p]:

φ [p] =

1 constant

p linear

p2 square

√p root

ln(p)− ln(1− p) logit

11+exp(− p) sigmoid

The first three functions after the constant are variations of the linear (linear, sub-

linear and super-linear). The last two are variations on theS-curve; we give above the

simplest forms of the functions, but in fact, they must be adjusted to fit in the range

[0,1]. Together they cover some of the most popular functions in machine learning.

The sub- and super-linear functions decrease and increase,respectively, the sen-

sitivity of the scoring function to changes in probabilities; and the S-curve functions

shift the scoring sensitivity either to the edges (i.e. towards 0 and 1) or to the mid-range

(i.e. around 0.5) probabilities: the logit reduces sensitivity at the mid-range, while the

other two increase the sensitivity at the mid-range. For example, the gradient of the

sigmoid is shallow at probabilities close to 0 and 1, so changes in the probabilities will

result in a small change in the score, but at probabilities around 0,5, the gradient is

steep, so a change in the probability will have a large effecton the score.

(d) Having found a match,m, and scored each of its characters, one must consider the

significance of findingm at all. With regard to this, there are perhaps two dimensions

to the matchm. Firstly there are the characters of the match, and secondlythere is

its length. For example, one might argue that if the length ofm is 5 and the class

represented by the tree contains a great many strings of length 5, then finding a match

of this length is of less significance than if there were only very few strings of this

length in the class. Equally, if the tree contains every combination of the 5 characters in

the the match, then to findm in particular is of less significance than ifmwere the only

combination of those characters which occurs. That is to say, for example, if a class

tree,T, contains a path for every permutation of the characters in the string ”abcde”,

it should not receive much of a reward for matching this particular combination. In

such a case, even though the probability (and therefore the reward) will be quite small,

there is a reward nevertheless. So for example, a class whichcontains a large number

14

of random strings will accumulate some score on any string itis presented with, and

we must adjust the score accordingly.

Such considerations regarding the diversity of T, reflectedby the structure of its

paths, led to the formulation of the following values forν(m|T):

ν(m|T) =

1 match unnormalised

f (m|T)∑i∈(m∗|T) f (i) match normalised

f (m|T)

∑i∈(m′ |T) f (i) match length normalised

wherem∗ is the set of all the strings formed by the permutations of theletters inm;

andm′ is the set of all strings of length equal to the length ofm.

2. Scoring a document

(a) To score a document one may consider it as a single string and score each successive

match using the scoring procedure described above. However, this would, firstly, run

the risk of missing longer matches in favour of shorter ones,if for example, a longer

match could have been made starting from one of the letters ofan earlier shorter match,

and, secondly, not sufficiently reward the longer matches. For instance, a match of 5

letters would be scored greater than a match of 4 letters by simply adding the score

associated with the fifth character, say by a score of 5 ratherthan 4 as in the case of

scoring by unity (see part 1c above). But in our view, the scoring should grow faster

than just linearly with the growth of the match length. This is why we adopt the view

that a document should be scored by the sum of the matches of all its suffixes. In this

way, the reward may grow faster as a result of two causes: (1) the sheer number of

suffixes in a string s leading to a quadratic rather than linear growth of the score, and

(2) rewarding all matches, both short and long, of all suffixes.

Thus the score for a documents is the sum:

SCORE(s,T) =1τ

n

∑i=0

score(s(i),T) (8)

whereτ is a normalisation coefficient defined below in part 2b. When we score a

document against a tree in this way, we are in effect scoring each substring which the

document and the tree have in common. Thus, if our significance function is unity, the

score for a document would be exactly the number of substrings it has in common with

the tree, and thereby, the class.

(b) The final aspect of scoring is what can be referred to astree-level normalisation(τ),

15

which takes account of differences in thesizeor the complexityof the class. The

considerations here are similar to those of match normalisation above, but are at the

level of the whole tree, rather than a specific match in the tree. The inclusion of such

a normalisation coefficient was also motivated to some extent by our early observation

that relative differences in the sizes of the classes could have quite a significant effect

on performance (see Section 5.1.1).

We experimented with the following values forτ:

τ =

1 unnormalised

size(T) size normalised

density(T) density normailsed

logTotalFreq(T) log total f requency normailsed

avFreq(T) average f requency normalised

avFirstLevelFreq(T) average f irst level f requency(AL1F) normalised

where the size, density and frequency measures are defined inabove. The total fre-

quency is logged so that the normalisation coefficient increases with the addition of

new strings (documents) into the tree in the same way as all the other coefficients,

which naturally grow by log.

5 Experimental Setup

All experiments were conducted under ten-fold cross validation. We accept the point made by [13]

that such a method does not reflect the way classifiers are usedin practice, but the method is widely

used and serves as a thorough initial test of new approaches.

We follow the convention by considering as true positives, those true spam which are classified

as spam; false positives are then true ham classified as spam;false negatives are spam classified

as ham; true negatives are ham classified as ham. See section 5.3 for more on the performance

measurements we use.

5.1 Experimental Parameters

5.1.1 Spam to Ham Ratios

From some initial tests we found that success was to some extent contingent on the proportion

of spam to ham in our data set – a point which is identified, but not systematically investigated

16

in other work [13] – and this therefore became part of our investigation. The differing results

further prompted us to introduce a form of normalisation, even though we had initially expected

the probabilities to take care of differences in the scale and mix of the data. Our experiments

used three different ratios of spam to ham: 1:1, 4:6, 1:5. Thefirst and second of these (1:1 and

4:6) were chosen to reflect some of the estimates made in the literature of the actual proportions of

spam in current global email traffic. The last of these (1:5) was chosen as the minimum proportion

of spam included in experiments detailed in the literature,for example in [2].

5.1.2 Tree Depth

It is too computationally expensive to build trees as deep asemails are long. Furthermore, the

marginal performance gain from increasing the depth of a tree, and therefore the length of the

substrings we consider, may be negative. Certainly, our experiments show a diminishing marginal

improvement (see Section 6.1.2), which would suggest a maximal performance level, which may

not have been reached by any of our trials. We experimented with substrings of length of 2, 4, 6,

and 8.

5.1.3 Threshold

From initial trials, we observed that the choice of threshold value in the classification criterion can

have a significant, and even critical, effect on performance, and so introduced it as an important

experimental parameter. We used a range of threshold valuesbetween 0.7 and 1.3, with increments

of 0.1, with a view to probing the behaviour of the scoring system.

5.2 Data

Three corpora were used to create the training and testing sets:

1. The Ling-Spam corpus(LS)

This is available from:http://www.aueb.gr/users/ion/data/lingspam_public.

tar.gz. The corpus is that used in [2]. The spam messages of the corpus were collected

by the authors from emails they received. The ham messages are taken from postings on a

public online linguist bulletin board for professionals; the list was moderated, so does not

have any spam. Such a source may at first seem biased, but the authors claim that this is

not the case. There are a total of 481 spam messages and 2412 ham messages, with each

message consisting of a subject and body. We do not use all these messages because we ex-

perimented with smaller data sets (see Table 1 below). We useinstead a randomly selected

subset of all the messages available.

17

2. Spam Assassin public corpus(SA)

This is available from:http://spamassassin.org/publiccorpus. The corpus was

collected from direct donations and from public forums overtwo periods in 2002 and 2003,

of which we use only the later. The messages from 2003 comprise a total of 6053 messages,

approximately 31% of which are spam. The ham messages are split into ’easy ham’ (SAe)

and ’hard ham’ (SAh), the former being again split into two groups (SAe-G1 and SAe-G2);

the spam is similarly split into two groups (SAs-G1 and SAs-G2), but there is no distinction

between hard and easy. The compilers of the corpus describe hard ham as being closer

in many respects to typical spam: use of HTML, unusual HTML markup, coloured text,

”spammish-sounding” phrases etc..

In our experiments we use ham from the hard group and the second easy group (SAe-G2);

for spam we use only examples from the second group (SAs-G2).Of the hard ham there are

only 251 emails, but for some of our experiments we required more examples, so whenever

necessary we padded out the set with randomly selected examples from group G2 of the

easy ham (SAe-G2); see Table 1. The SA corpus reproduces all header information in full,

but for our purposes, we extracted the subjects and bodies ofeach; the versions we used are

available at:http://dcs.bbk.ac.uk/~rajesh/spamcorpora/spamassassin03.zip

3. The BBKSpam04 corpus(BKS)

This is available at:http://dcs.bbk.ac.uk/~rajesh/spamcorpora/bbkspam04.zip.

This corpus consists of the subjects and bodies of 600 spam messages received by the au-

thors during 2004. The Birkbeck School of Computer Science and Information Systems

uses Spam Assassin, so all the spam in this corpus has initially evaded that filter. The cor-

pus is further filtered so that no two emails share more than half their substrings with others

in the corpus. This corpus is important to our argument because it is largely populated with

the kind of examples we draw attention to in section 2. We believe that this corpus more

accurately reflects the current level of evolution in spam messages, while our other two

corpora contain a greater proportion of undisguised spam.

Oneemail data set(EDS) consisted of a set of spam and a set of ham. Using messages from

these three corpora, we created the EDSs shown in Table 1. Thefinal two numbers in the code for

each email data set indicate the mix of spam to ham; three mixes were used: 1:1, 4:6, and 1:5. The

letters at the start of the code indicate the source corpus ofthe set’s spam and ham, respectively.

Hence the grouping; for example, EDS SAe-46 is comprised of 400 spam mails taken from the

group SAs-G2 and 600 ham mails from the group SAe-G2, and EDS BKS-SAeh-15 is comprised

of 200 spam mails from the BKS data set and 1000 ham mails made up of 800 mails from the

SAe-G2 group and 200 mails from the SAh group.

18

EDS Code Spam Source Ham Source(number from source) (number from source)

LS-11 LS (400) LS (400)LS-46 LS (400) LS (600)LS-15 LS (200) LS (1000)

SAe-11 SAs-G2 (400) SAe-G2 (400)SAe-46 SAs-G2 (400) SAe-G2 (600)SAe-15 SAs-G2 (200) SAe-G2 (1000)

SAeh-11 SAs-G2 (400) SAe-G2 (200) + SAh (200)SAeh-46 SAs-G2 (400) SAe-G2 (400) + SAh (200)SAeh-15 SAs-G2 (200) SAe-G2 (800) + SAh (200)

BKS-LS-11 BKS (400) LS (400)BKS-LS-46 BKS (400) LS (600)BKS-LS-15 BKS (200) LS (1000)

BKS-SAe-11 BKS (400) SAe-G2 (400)BKS-SAe-46 BKS (400) SAe-G2 (600)BKS-SAe-15 BKS (200) SAe-G2 (1000)

BKS-SAeh-11 BKS (400) SAe-G2 (200) + SAh (200)BKS-SAeh-46 BKS (400) SAe-G2 (400) + SAh (200)BKS-SAeh-15 BKS (200) SAe-G2 (800) + SAh (200)

Table 1: Composition of Email Data Sets (EDSs) used in the experiments.

5.2.1 Pre-processing

For the suffix tree classifier, no pre-processing is done.

For the the naive Bayesian classifier, we use the following three conventional pre-processing

procedures:

1. Remove all punctuation.

2. Remove all stop-words.

3. Stem all remaining words.

Words are taken as strings of characters separated from other strings by one or more whitespace

characters (spaces, tabs, newlines). Punctuation is removed first in the hope that many of the

intra-word characters which spammers use to confuse a Bayesian filter will be removed. Our

stop-word list consisted of the 57 of the most frequent prepositions, pronouns, articles and

conjunctives. Stemming was done using an implementation ofPorter’s 1980 algorithm [15]. For

19

more general information on these and other approaches to pre-processing, the reader is directed

to [12].

5.3 Performance Measurement

Following Sahami et. al. [16], Androutsopoulos et al. [2], and others, the measurement parame-

ters we began with wererecall andprecisionfor both spam and ham. For spam (and similarly for

ham) these measurements are defined as follows:

Spam Recall(SR) = NS→SNS→S+NS→H

, Spam Precision(SP) = NS→SNS→S+NH→S

,

whereNX→Y means the number of items of classX assigned to classY. Spam recall measures the

proportion of all spam messages which were identified as spamand spam precision measures the

proportion of all messages classified as spam which truly arespam; and similarly for ham.

For those who prefer to think in terms of true and false positive and negatives,Spam Precision

is a reflection of the proportion of true positives (it is the equivalent to the proportion of positive

which were true) andHam Precisionis a reflection of the proportion of true negatives (it is

equivalent to the proportion of negatives which were true).One minus each of these values will

give us the proportion of false positives and false negativerespectively.

However, we eventually decided to only consider the spam andham precision, as together

these can represent the success we are looking for in the classifier. In our opinion, recall has

less significance in this domain than in information retrieval where it is more commonly used to

measure the success of search engines. In the latter it is used to measure the number of relevant

webpages returned as a proportion of all those pages which are in fact relevant. However in our

trials, we know that we are classifying all the known examples, so a spam message which is not

identified as such is not missed altogether, but is classifiedas ham. Thus, recall and precision are

highly related, such that, a lower spam recall will necessarily result in a lower ham precision, and

vise-versa.

Ideally we would like to maximise both spam and ham precisionvalues, but the greater priority

is apportioned to the former, as a missclassified ham messageis worse than a misclassified spam

message. A spam precision of 100% means that there were no false positives, however, it is

possible to achieve this by simply classifying all messagesas ham. We therefore also attempt

to achieve the highest possible value for ham precision - 100% means that there were no false

negatives. As the precisions are so high, it makes sense to consider the precision error (1−precision) so that the scale is better in our graphs.

All this gives us our three measures of success: spam precision error (SPE), ham precision

20

error (HPE) and the sum of the errors (SPE + HPE). We generallyquote these values as percent-

ages.

Finally, we define as ’optimal’ the state in which the sum of the errors is minimised.

6 Results

We consider the results of suffix tree and naive Bayesian classification separately in sections 6.1

and 6.2, before comparing them in section 6.3. In Section 6.4, we compare our results to those of

other researchers.

6.1 Suffix Tree

6.1.1 Effect of Significance Function

We found that all the significance functions we tested workedvery well, and with each of them it

was possible, by increasing the depth and moving the threshold up or down, to achieve spam and

ham precision error levels both below 3% or 4%, and with some functions on some data sets it was

possible to reduce both to 0%. It was therefore not easy to select one function over another, as can

be seen from Table 2, which shows the sum of error (SPE + HPE) values achieved on each email

data set at a conventional threshold of 1 by each significancefunction under match normalisation,

which proved to be the most successful method of normalisation (see Section 6.1.4). Indeed,

we found that the choice of threshold could be more importantthan the choice of function (see

Section 6.1.3).

From Table 2 we might select, as the most consistent performer, one out of theroot, logit or

constantfunctions. But even though these functions most frequentlyachieve the best performance

levels, there is little separation between them and the other specifications, although, we might

eliminate theconstantfunction on the grounds that when it does perform badly, as onthe BKS-

SAe and BKS-SAeh data sets, it trails the others by a relatively large margin.

The choice does not become much easier even if we consider theperformance of each func-

tion at an individually optimal threshold (the threshold which minimises the sum of errors, see

Section 5.3 for a definition), as can be seen from the results shown in Table 3.

However, theroot function now looks marginally the best candidate, and so it was decided

that this significance should be used in our comparisons withthe NB results in Section 6.3. Nev-

ertheless, to give the reader a sense of the performance of different functions, we give examples

throughout Section 6.1 using a variety of them.

21

Sum of Error Values (%) atth= 1for specifications ofφ [p]

EDS Code constant linear square root logit sigmoid

LS-11 1.50 2.24 2.50 1.70 1.74 1.75LS-46 1.58 1.89 2.61 1.41 2.13 1.89LS-15 1.33 1.33 2.42 1.33 2.42 1.45

SAe-11 0.25 0.50 0.50 0.50 0.25 0.74SAe-46 0.33 0.50 0.33 0.33 0.17 0.50SAe-15 0.20 0.30 0.80 0.30 0.70 0.40

SAeh-11 7.00 7.00 7.50 6.75 6.49 6.50SAeh-46 4.01 4.58 5.07 5.00 4.34 4.67SAeh-15 3.40 6.37 6.50 5.02 4.51 4.74

BKS-LS-11 0 0 0 0 0 0BKS-LS-46 0 0 0 0 0 0BKS-LS-15 0 0.30 0.30 0.20 0 0.30

BKS-SAe-11 4.53 1.72 1.48 1.48 1.48 1.72BKS-SAe-46 2.91 1.15 1.32 1.15 0.99 1.80BKS-SAe-15 1.86 1.19 1.19 1.09 1.09 1.67

BKS-SAeh-11 8.47 5.44 6.76 4.76 5.44 6.76BKS-SAeh-46 6.40 3.38 4.46 2.76 3.23 4.61BKS-SAeh-15 3.01 1.86 1.86 1.86 1.86 2.82

Table 2: Sum of errors values at a conventional threshold of 1for all significancefunctions under match normalisation. The best scores for each data set are high-lighted in bold.

22

Sum of Error Values (%) at optimalthfor specifications ofφ [p]

EDS Code constant linear square root logit sigmoid

LS-11 1.23 0.99 0.99 0.99 0.99 0.99LS-46 1.00 1.00 1.08 1.00 1.00 0.83LS-15 0.55 0.44 0.55 0.44 0.55 0.66

SAe-11 0.25 0 0 0 0 0.25SAe-46 0.33 0.33 0.33 0 0.33 0.33SAe-15 0.20 0.30 0.40 0.30 0.30 0.30

SAeh-11 6.92 6.50 5.98 6.24 6.22 6.44SAeh-46 4.01 4.32 4.58 3.92 4.33 4.42SAeh-15 3.40 3.26 4.23 2.86 3.61 3.78

BKS-LS-11 0 0 0 0 0 0BKS-LS-46 0 0 0 0 0 0BKS-LS-15 0 0 0 0 0 0

BKS-SAe-11 0 0 0 0 0 0BKS-SAe-46 0 0 0 0 0 0BKS-SAe-15 0.70 0 0 0 0.10 0.10

BKS-SAeh-11 2.73 1.73 1.99 1.72 1.99 1.99BKS-SAeh-46 1.50 0.99 0.99 1.15 1.08 0.99BKS-SAeh-15 1.67 1.01 1.09 1.21 1.21 1.09

Table 3: Sum of errors values at individual optimal thresholds for all significancefunctions under match normalisation. The best scores for each data set are high-lighted in bold.

23

Depth Spam Precision Error (SPE) %Ham Precision Error (HPE) %2 5.35 14.384 2.23 1.266 0.75 1.248 0.50 1.00

Table 4: Classification Errors by Depth -φ [p] = root(p), match unormalised, tree nor-malised withτ = AL1F , th= 1.0, EDS = LS-11.

6.1.2 Effect of Depth Variation

For illustrative purposes, Table 4 shows the results using theroot function (φ [p] = root(p)), with

no match normalisation (ν(m|T) = 1), and tree-levelaverage f irst level f requencynormalisation

(τ = AL1F); the EDS used was LS-11. Depths of 2, 4, 6, and 8 are shown.

The table demonstrates a characteristic which is common to all considered combinations of

significance and normalisation functions: performance improves as the depth increases. There-

fore, in further examples, we consider only our maximum depth of 8. Notice also the decreasing

marginal improvement as depth increases, which suggests that there may exist a maximal perfor-

mance level.

6.1.3 Effect of Threshold Variation

We generally found that there was an optimal threshold (or range of thresholds) which maximised

the success of the classifier (minimised the sum of the errors). As can be seen from the four

example graphs shown in Figure 3, the optimal threshold varies depending on the functions used

and the mix of ham and spam in the training and testing sets, but it tends to always be close to 1.

Obviously, it may not be possible to know the optimal threshold in advance, but we expect,

though have not shown, that the optimal threshold can be established during a secondary stage of

training where only examples with scores close to the threshold are used - similar to what Meyer

and Whateley [13] call ”non-edge training”.

In any case, the main reason for using a threshold is to allow apotential user to decide the level

of false positive (SPE) risk they are willing to take. Reducing the risk carries with it an inevitable

rise in false negatives (HPE).

The shapes of the graphs are typical for all values ofφ(p); the performance of a particular

scoring configuration is reflected not only by the minimums achieved at optimal thresholds but

also by the steepness (or shallowness) of the curves: the steeper they are, the more rapidly errors

rise at sub-optimal levels, making it harder to achieve zeroSPE without a considerable rise in

HPE.

24

0

5

10

15

20

0.85 0.9 0.95 1 1.05 1.1 1.15

PercentageError

Threshold

a) Precision Errors - φ(p) = root(p),unnormalised, s:h=1:1

SPEHPE

0

5

10

15

20

0.85 0.9 0.95 1 1.05 1.1 1.15

PercentageError

Threshold

a) Precision Errors - φ(p) = sigmoid(p),unnormalised, s:h = 1:1

SPEHPE

0

5

10

15

20

0.85 0.9 0.95 1 1.05 1.1 1.15

PercentageError

Threshold

c) Precision Errors - φ(p) = square(p),unnormalised, s:h=1:5

SPEHPE

0

5

10

15

20

0.85 0.9 0.95 1 1.05 1.1 1.15

PercentageError

Threshold

d) Precision Errors - φ(p) = logit(p),unnormalised, s:h=4:6

SPEHPE

Figure 3: Precision error levels for four values ofφ(p) applied to the Ling-Spamdata sets using no normalisation. Note that the asymmetry ingraph (c) is due moreto the mix of spam to ham than to the scoring configuration.

25

6.1.4 Effect of Normalisation

We found that there was a consistent advantage to using matchnormalisation, which is based on

the permutations of the characters in the match we were scoring. The method did not limit by

much the variation in the optimal threshold, but it did improve the overall performance of the

classifier in that the error levels were reduced for all threshold values; that is to say, the error

levels of non-optimal thresholds moved much closer to thoseof the optimal.

The effect can be clearly seen in Figure 4 which shows the effect of match normalisation with

three different mixes of spam and ham, using a significance functionφ(p) = 1. The graphs in the

right column are shallower than those on the left, so that if we were to move the threshold up or

down in order to achieve either a zero ham precision error (HPE) or spam precision error (SPE)

respectively, there would be smaller error penalty in the other precision.

This is the advantage of match normalisation: that overall performance is improved, inde-

pendent of threshold, and to some degree independent of the significance function we choose.

We observed that no other approaches to normalisation were consistently beneficial. Some

approaches made no difference at all and others even made matters worse, sometimes drastically.

Figure 5 shows the effects of three types of normalisation ona linear significance function,φ [p] =

p, using a spam:ham mix of 1:1. Graph (a) shows the error levelswith no normalisation; graph

(b) shows that there is almost no change when we introducetree sizenormalisation; graph (c)

shows that textitfrequency normalisation does not change the shape of the graph but shifts it to

the right; and graph (d) shows that use ofmatch lengthnormalisation leads to the classification

almost breaking down completely, with spam precision errors (SPE) remaining above 20% and

therefore not even appearing on the graph.

Some methods, namely tree-leveldensity, average frequencyandaverage first level frequency

normalisation, proved useful with some configurations of the classifier, but again nothing consis-

tent.

6.2 Naive Bayes

Let us now consider the performance of the naive Bayesian (NB) classifier before comparing the

two approaches in the next section. Figure 6 shows the ham andspam precision errors using the

LS-based email data sets.

Overall, NB performed worse than the suffix tree (ST). Performance varies with the spam:ham

mix, as does the optimal threshold, but not to the same extent.

The error increases far more rapidly as we move away from the optimal threshold. This is

important because it makes it difficult to achieve a 0% (or near 0%) spam precision error while

26

0

10

20

30

40

50

60

70

80

0.7 0.8 0.9 1 1.1 1.2 1.3

PercentageError

Threshold

a) Precision Errors - φ(p) = 1,unnormalised, s:h=1:1

SPEHPE

0

10

20

30

40

50

60

70

80

0.7 0.8 0.9 1 1.1 1.2 1.3

PercentageError

Threshold

b) Precision Errors - φ(p) = 1,match normalised, s:h=1:1

SPEHPE

0

10

20

30

40

50

60

70

80

0.7 0.8 0.9 1 1.1 1.2 1.3

PercentageError

Threshold

c) Precision Errors - φ(p) = 1,unnormalised, s:h=4:6

SPEHPE

0

10

20

30

40

50

60

70

80

0.7 0.8 0.9 1 1.1 1.2 1.3

PercentageError

Threshold

d) Precision Errors - φ(p) = 1,match normalised, s:h=4:6

SPEHPE

0

10

20

30

40

50

60

70

80

0.7 0.8 0.9 1 1.1 1.2 1.3

PercentageError

Threshold

e) Precision Errors - φ(p) = 1,unnormalised, s:h=1:5

SPEHPE

0

10

20

30

40

50

60

70

80

0.7 0.8 0.9 1 1.1 1.2 1.3

PercentageError

Threshold

f) Precision Errors - φ(p) = 1,match normalised, s:h=1:5

SPEHPE

Figure 4: Effect of match normalisation on the spam precision (SP) and ham pre-cision (HP) errors for spam to ham ratios (s:p) of 1:1, 4:6, 1:5, using Ling-Spamemail data sets. The left column (graphs a, c and e) is unnormalised, while the rightcolumn (graphs b, d and f) is match normalised. The graphs show the effect whenφ(p) = 1, but the effect is similar for all values ofφ(p).

27

0

5

10

15

20

0.85 0.9 0.95 1 1.05 1.1 1.15

PercentageError

Threshold

a) Precision Errors - φ(p) = p,unnormalised, s:h=1:1

Error SPError HP

0

5

10

15

20

0.85 0.9 0.95 1 1.05 1.1 1.15

PercentageError

Threshold

b) Precision Errors - φ(p) = p,size normalised, s:h=1:1

Error SPError HP

0

5

10

15

20

1 1.05 1.1 1.15 1.2 1.25 1.3

PercentageError

Threshold

c) Precision Errors - φ(p) = p,average frequency normalised, s:h=1:1

Error SPError HP

0

5

10

15

20

0.85 0.9 0.95 1 1.05 1.1 1.15

PercentageError

Threshold

d) Precision Errors - φ(p) = p,match length normalised, s:h=1:1

Error SPError HP

Figure 5: Effect of normalisation on precision errors usingthe Ling-Spam data setwith a 1:1 mix of spam to ham andφ(p) = p.

28

0

5

10

15

20

0.85 0.9 0.95 1 1.05 1.1 1.15

PercentageError

Threshold

a) Precision Errors - s:h=1:1

SPEHPE

0

5

10

15

20

0.85 0.9 0.95 1 1.05 1.1 1.15

PercentageError

Threshold

b) Precision Errors - s:h=4:6

SPEHPE

0

5

10

15

20

0.85 0.9 0.95 1 1.05 1.1 1.15

PercentageError

Threshold

c) Precision Errors - s:h=1:5

SPEHPE

Figure 6: Precision error levels for the naive Bayesian classifier using the Ling-Spam data sets and spam to ham mixes of 1:1 (graph a), 4:6 (graph b), and 1:5(graph c).

maintaining a low ham precision error. This can be seen clearly by comparing graph (c) in Figure 6

with graph (c) in Figure 3, which both show results using a 1:5mix of spam to ham. Clearly, the

NB approach results in much higher error levels on either side of the optimal threshold.

There is some variation in the the optimal threshold as the mix of spam to ham changes, but

the the value always stays close to 1, just as with the suffix tree approach.

6.3 Comparison of Suffix Tree and Naive Bayes

From Table 5, we can see that the two approaches have similar success at a threshold of 1 with

the LS-based data sets, but in all other cases, the ST performs better. The performance of the NB

approach suffers most, as expected, when applied to data sets containing the disguised kinds of

spam we introduced in section 2, namely those using examplesfrom BKS. Notice also that NB has

a lot of trouble with SAeh, which contains ham with feature resembling spam; a very high spam

29

Naive Bayes Suffix TreeEDS Code SPE (%) HPE (%) SPE (%) HPE (%)

LS-11 1.24 0.50 0.99 0.75LS-46 1.00 0.38 1.24 0.17LS-15 4.81 0.20 1.11 0.22

SAe-11 0 2.69 0 0.50SAe-46 0.25 1.32 0 0.83SAe-15 4.46 0.70 0 0.30

SAeh-11 9.63 1.65 3.49 3.26SAeh-46 7.98 1.39 3.00 2.00SAeh-15 18.06 1.43 3.63 1.39

BKS-LS-11 0 10.91 0 0BKS-LS-46 0.29 8.41 0 0BKS-LS-15 1.41 5.67 0 0.20

BKS-SAe-11 0 8.26 0 1.48BKS-SAe-46 0 5.1 0 1.15BKS-SAe-15 5.56 2.94 0 1.09

BKS-SAeh-11 14.22 0.60 0 4.76BKS-SAeh-46 10.96 0.36 0 2.75BKS-SAeh-15 30.00 1.18 0 1.86

Table 5: Classification Errors at threshold (th) = 1, for Naive Bayes (NB) and aSuffix Tree (ST) withφ(p) = root(p), match normalised, tree unnormalised. Forthe composition of each email data set (EDS) see Table 1.

precision error (SPE) indicates that many of these ham messages were incorrectly classified as

spam. The ST also struggles with this email data set, but fares better, with SPE and HPE roughly

equivalent; this suggests that the boundary between the classes has become blurred, which is to

be expected, considering the nature of the ham. Overall, theST is able to maintain a relatively

high performance on all data sets.

Two unexpected results were the success the Bayesian filter when dealing with the BKS-SAe

email data sets and the improvement in performance of the suffix tree filter when moving from

the SAeh to the BKS-SAeh data sets.

In the case of the former, we expected the NB filter to perform well when it dealt with vanilla

spam combined with regular examples of ham (of the sort foundin the SAe ham), but not so well

30

when dealing with data sets containing the more difficult kinds of spam found in the BKS corpus.

The success might be explained by stage one in pre-processing: removal of punctuation. This

stage would remove the intra-word characters that spammersoften use, making it far more likely

that words obfuscated in this way – using different characters, in different positions – would be

converted to some common form which could then be recognisedin new spam mail. However,

we do not offer here a detailed analysis; see the future work section.

In the case of the latter, we expected a greater degree of similarity between the spam and ham

examples in the BKS-SAeh data sets than between those in the SAeh data sets, but this is clearly

not the case. Although a slightly higher HPE on the second data set shows that more spam have

been missclassified as ham (false negatives), we would have expected the same would happen the

other way around (more false positives). To understand why this has happened, we would need to

take a closer (comparative) look at the examples in the two data sets.

We now turn to the performance of each filter at its optimal threshold (recall that we define

the optimal as that which minimises the sum of the errors (SP and HP)). As we have not offered

a means of establishing a priori this optimal, we might thinkof the results shown in Table 6 as

theoretical, best-case, performance levels.

We can see that there is an improvement on the part of both filters, but again the ST is the

most successful of the two.

Although the actual optimal threshold varies more for the STthan for the NB, with a range

of [0.92,1.22], compared to NB’s [0.94,1.06], the ST’s optimal threshold is sometimes a range of

values, which indicates very strong performance.

The worst results for the ST are, as before, for the SAeh and the BKS-SAeh data sets. The

NB filter shows a dramatic improvement in its handling of these same two data sets, but the sum

of errors is still close to 10%.

6.4 Comparison with results of other work

Table 7 shows the results reported by Androutsopoulos et al.in [2], where they used the LS data

set. The results of their Bayesian filter are slightly different from our results due probably to

differences in pre-processing and feature selection. Androutsopoulos et al. also use much larger

data sets, with 2412 ham messages and 481 spam messages, but the mix of spam to ham is 1:5,

and therefore certainly comparable to our results using thesame mix. Results reported in other

references are harder to compare because the work involves parameters not replicated in our work,

such as an ’uncertain’ category, and consideration of the order in which emails arrive [21], [13].

We also experimented with an uncertain category in early trials, but decided not to use one in the

end because our intension was always to study the propertiesof the ST classifier; and observing

31

Naive Bayes Suffix TreeEDS Code OtpTh SPE (%) HPE (%) OptTh SPE (%) HPE (%)

LS-11 1.0 1.24 0.50 0.96 0.00 0.99LS-46 1.0 1.00 0.83 0.96 0.50 0.50LS-15 0.96 0 1.48 0.92 0.00 0.44

SAe-11 1.06 0.25 0.0 1.10 0 0SAe-46 1.04 0.50 0.17 1.06 0 0SAe-15 0.98 1.61 1.68 0.98 - 1.00 0 0.30

SAeh-11 0.98 5.70 4.80 0.98 2.77 3.47SAeh-46 0.98 4.11 4.42 0.98 1.77 2.15SAeh-15 0.98 9.57 2.96 0.98 1.09 1.77

BKS-LS-11 1.04 0.76 2.22 0.84 - 1.14 0 0BKS-LS-46 1.04 2.03 2.31 0.78 - 1.16 0 0BKS-LS-15 1.00 1.41 5.67 1.02 - 1.22 0 0

BKS-SAe-11 1.06 0 0.25 1.04 - 1.28 0 0BKS-SAe-46 1.04 0.5 0.33 1.18 - 1.28 0 0BKS-SAe-15 0.98 2.03 5.23 1.22 0 0

BKS-SAeh-11 0.98 7.55 2.13 1.06 0 1.72BKS-SAeh-46 0.98 5.87 2.54 1.06 0 1.15BKS-SAeh-15 0.94 3.15 7.18 1.12 0.51 0.70

Table 6: Classification Errors at optimal thresholds (wherethe sum of the erroris minimised) for Naive Bayes (NB) and a Suffix Tree (ST) withφ(p) = root(p),match normalised and tree unnormmalised.

32

Filter Configuration th No. of attrib. Spam Recall Spam Precision(a) bare 0.5 50 81.10% 96.85%(b) stop-list 0.5 50 82.35% 97.13%(c) lemmatizer 0.5 100 82.35% 99.02%(d) lemmatizer + stop-list 0.5 100 82.78% 99.49%(a) bare 0.9 200 76.94% 99.46%(b) stop-list 0.9 200 76.11% 99.47%(c) lemmatizer 0.9 100 77.57% 99.45%(d) lemmatizer + stop-list 0.9 100 78.41% 99.47%(a) bare 0.999 200 73.82% 99.43%(b) stop-list 0.999 200 73.40% 99.43%(c) lemmatizer 0.999 300 63.67% 100.00%(d) lemmatizer + stop-list 0.999 300 63.05% 100.00%

Table 7: Results of Androutsopoulos et al. [2] on the Ling-Spam corpus. The ’FilterConfiguration’ column states the kind of pre-processing used on the data set: ’bare’indicates no pre-processing, ’stop-list’ indicates that very common words such aspronouns, articles, prepositions etc, were removed, and ’lemmatizer’ indicates thatall words were reduced to their root; for more information onthese preprocessingtechniques, the reader is directed to [12], [20]. The columnlabelled ’No. of attrib.’indicates the number of word features which the authors retained as indicators ofclass - again a common techniques which is explained in the references given. An-droutsopoulos don’t actually quote the threshold, but a ’cost value’, which we haveconverted into its threshold equivalent.

its behaviour at a range of thresholds was enough to do this.

Table 8 shows the suffix tree results using spam recall and spam precision as the performance

measures so that they are comparable with those in Table 7. Wereport the results from all of the

three mixes of spam to ham we used. We did not experiment with exactly the same thresholds,

but we report those which most are comparable: the first threerows show results for a threshold

of 0.9, the same as that used by Androutsopoulos in [2]; the next three rows show results for a

threshold of 1.00, which is the closest we have to Androutsopoulos’ threshold of 0.999. The final

three rows show the spam recall and precision values at the thresholds which are optimal for the

suffix tree. Of course, as we have said earlier, the suffix treeapproach involves no pre-processing

of any kind. As can be seen by comparing the results presentedint the two tables, the performance

levels for precision are comparable, but the suffix tree simultaneously achieves much better results

for recall.

33

Spam:Ham Ratio th Spam Recall Spam Precision1:1 0.9 97.00% 100.00%4:6 0.9 98.25% 99.75%1:5 0.9 97.22% 100.00%1:1 1.00 99.25% 99.00%4:6 1.00 99.75% 98.76%1:5 1.00 98.89% 98.89%1:1 0.96 99.00% 100.00%4:6 0.96 99.25% 99.50%1:5 0.92 97.78% 100.00%

Table 8: Results from ST classification withφ(p) = root(p), match normalised andtree unnormmalised.

7 Conclusion

Clearly, the suffix tree universally outperforms the naive Bayesian filter, which is to be expected

as the suffix tree makes a much more detailed analysis of the classes by diverging from the ’bag-

of-words’ paradigm underlying the naive Bayes approach.

We conclude that the choice of significance function is the least important factor in the success

of the ST because all of them performed acceptably well. Different functions will perform better

on different data sets, but the root function performs marginally more consistently well on all data

sets.

The threshold was found to be a very important factor in the success of the filter. So much

so, that the differences in the performances of particular configurations of the filter were often

attributable more to a differences in their corresponding optimal thresholds than to the configura-

tions themselves.

Match normalisation was found to be the most effective method of normalisation and was able

to improve the performance of all significance functions. Inparticular it was able to improve the

success of the filter at all threshold values and was to some extent able to decrease the variation

in performance from different mixes of spam to ham. Other methods of normalisation were not

always so effective, with some of them making things drastically worse. However, normalisation

functions based on frequencies in the class or on the densityof the tree could be effective under

certain conditions.

Unfortunately, even once the most effective configuration had been chosen, not all the be-

haviour of the ST was as expected. Whereas its character level processing handles the BKS data

very well – even better than it handles the supposedly easierLS data – it struggles with the hard

ham from SAeh, which we expected to contain exactly the sort of examples which would yield to

character level analysis. To understand the relative performance of the ST approach on the email

34

data sets SAeh and BKS-SAeh, we would need to take a close lookat the two contributing data

sets, something we don’t have space for in this paper.

The NB filter, on the other hand, performs pretty much as expected: well on the easier LS

and SAe data sets, and worse on the SAeh and BKS-SAeh data sets. We also found that the spam

precision error (SPE) and ham precision error (HPE) curves created by varying the threshold,

were in all cases steeper for the NB approach that for the ST approach, indicating that the former

always performs relatively worse at non-optimal thresholds and thereby makes it more difficult to

minimise one error without a significant cost in terms of the other error.

We can conclude then that there does seem to be a clear advantage in terms of accuracy in using

the suffix tree to filter emails, but this must be balanced against its higher computational demands.

In this paper, we have given little attention to assessing this factor – which is of importance when

considering the development of the method into a viable email filtering application – but this

would obviously have to be one of the next steps.

In the case of both filters, it is clear that discovering the optimal threshold – if it were possible

– is a good way of improving performance. It may be possible todo this during an additional

training phase in which we use some proportion of the training examples to test the filter and

adjusted the threshold up or down depending on the outcome ofeach test. Of course, the threshold

may be continuously changing, but this could be handled to some extent dynamically during the

actual use of the filter by continually adjusting it in the light of any mistakes made. This would

certainly be another possible line of investigation.

Finally, email filtering is not the only application for the suffix tree method. It could be applied

to the classification of other kinds of documents, and we intend to investigate this in future work.

References

[1] K. Aas and L. Eikvil. Text categorisation: A survey. Technical report, Norwegian computing

center. Available online:citeseer.ist.psu.edu/aas99text.html, June 1999.

[2] I. Androutsopoulos, J. Koutsias, K.V. Chandrinos, G. Paliouras, and C.D. Spyropoulos. An

evaluation of naive bayesian anti-spam filtering. In V. Moustakis G. Potamias and M. van

Someren, editors,Proceedings of the Workshop on Machine Learning in the New Infor-

mation Age, 11th European Conference on Machine Learning (ECML 2000), pages 9–17,

Barcelona, Spain, 2000.

[3] Gill Bejerano and Golan Yona. Variations on probabilistic suffix trees: statistical modeling

and prediction of protein families.Bioinformatics, 17(1):23–43, 2001.

35

[4] S. de Freitas and M. Levene. Spam on the internet: Is it here to stay or can it be eradicated?

JISC Technology and Standards Watch Reports, (04-01), 2004.

[5] Robert Giegerich and Stefan Kurtz. From Ukkonen to McCreight and Weiner: A unifying

view of linear-time suffix tree construction.Algorithmica, 19(3):331–353, 1997.

[6] John Graham-Cummings. Spammers compendium. Webpage (last accessed October 20,

2004):http://www.jgc.org/tsc/, 2004.

[7] D. Gusfield. Algorithms on Strings, Trees, and Sequences: Computer Science and Compu-

tational Biology. Cambridge Unversity Press, 1997.

[8] Stefan Kurtz. Reducing the space requirement of suffix trees.Software Practice and Expe-

rience, 29(13):1149–1171, 1999.

[9] David D. Lewis. Naive (Bayes) at forty: The independenceassumption in information

retrieval. In Claire Nedellec and Celine Rouveirol, editors,Proceedings of ECML-98, 10th

European Conference on Machine Learning, number 1398, pages 4–15, Chemnitz, DE,

1998. Springer Verlag, Heidelberg, DE.

[10] A. Lloyd. Suffix trees. Webpage (last accessed October 20, 2004):http://www.csse.

monash.edu.au/~lloyd/tildeAlgDS/Tree/Suffix/, 2000.

[11] Bingwen Lu and Ting Chen. A suffix tree approach to the interpretation of tandem mass

spectra: applications to peptides of non-specific digestion and post-translational modifica-

tions. Bioinformatics, 1990(02):113ii–121, 2003.

[12] C. Manning and H. Schutze.Foundations of Statistical Natural Language Processing. MIT

Press, 1999.

[13] T.A Meyer and B Whateley. Spambayes: Effective open-source, bayesian based, email clas-

sification system. InProceedings of the First Conference on Email and Anti-Spam (CEAS),

Mountain View, CA, July 2004. Available online:http://www.ceas.cc/papers-2004/

136.pdf.

[14] Eirinaios Michelakis, Ion Androutsopoulos, GeorgiosPaliouras, George Sakkis, and Pana-

giotis Stamatopoulos. Filtron: A learning-based anti-spam filter. In Proceedings of the First

Conference on Email and Anti-Spam (CEAS), Mountain View, CA, July 2004. Available

online:http://www.ceas.cc/papers-2004/142.pdf.

[15] M. F. Porter. An algorithm for suffix stripping. InReadings in information retrieval, pages

313–316. Morgan Kaufmann Publishers Inc., 1997.

[16] Mehran Sahami, Susan Dumais, David Heckerman, and EricHorvitz. A bayesian approach

to filtering junk E-mail. InLearning for Text Categorization: Papers from the 1998 Work-

36

shop, Madison, Wisconsin, 1998. AAAI Technical Report WS-98-05. Available online:

http://citeseer.ist.psu.edu/sahami98bayesian.html.

[17] Fabrizio Sebastiani. Machine learning in automated text categorization.ACM Computing

Surveys, 34(1):1–47, 2002.

[18] E. Ukkonen. Constructing suffix-trees on-line in linear time. Algorithms, Software, Archi-

tecture: Information Processing, 1(92):484–92, 1992.

[19] UNSPAM. Spam numbers and statistics. Webpage (last accessed October 20, 2004):http:

//www.unspam.com/fight_spam/information/spamstats.html, 2004.

[20] S. M. Weiss, N. Indurkhya, T. Zhang, and F. J. Damerau.Text Mining: Predictive Methods

for Analyzing Unstructured Information. Springer, 2005.

[21] Gregory L. Wittel and S. Felix Wu. On attacking statistical spam filters. InProceedings

of the First Conference on Email and Anti-Spam (CEAS), Mountain View, CA, July 2004.

Available online:http://www.ceas.cc/papers-2004/170.pdf.

37

Documents

Rajesh M. Pampapathi, Boris Mirkin, Mark Levene rajesh ... · Buy cheap medications online, no prescription needed. We have Viagra, Pherentermine, Levitra, Soma, Ambien, Tramadol