46
A Suffix Tree Approach to A Suffix Tree Approach to Text Classification Applied Text Classification Applied to Email Filtering to Email Filtering Rajesh Pampapathi, Boris Mirkin, Mark Rajesh Pampapathi, Boris Mirkin, Mark Levene Levene School of Computer Science and Information Systems School of Computer Science and Information Systems Birkbeck College, University of London Birkbeck College, University of London

A Suffix Tree Approach to Text Classification Applied to Email Filtering Rajesh Pampapathi, Boris Mirkin, Mark Levene School of Computer Science and Information

Embed Size (px)

Citation preview

Page 1: A Suffix Tree Approach to Text Classification Applied to Email Filtering Rajesh Pampapathi, Boris Mirkin, Mark Levene School of Computer Science and Information

A Suffix Tree Approach to Text A Suffix Tree Approach to Text Classification Applied to Email Classification Applied to Email

FilteringFilteringRajesh Pampapathi, Boris Mirkin, Mark Rajesh Pampapathi, Boris Mirkin, Mark

LeveneLevene

School of Computer Science and Information SystemsSchool of Computer Science and Information Systems Birkbeck College, University of London Birkbeck College, University of London

Page 2: A Suffix Tree Approach to Text Classification Applied to Email Filtering Rajesh Pampapathi, Boris Mirkin, Mark Levene School of Computer Science and Information

Introduction – OutlineIntroduction – Outline

Motivation: Examples of Spam Suffix Tree constructionSuffix Tree construction Document scoring and classificationDocument scoring and classification Experiments and resultsExperiments and results ConclusionConclusion

Page 3: A Suffix Tree Approach to Text Classification Applied to Email Filtering Rajesh Pampapathi, Boris Mirkin, Mark Levene School of Computer Science and Information

Buy cheap medications online, no prescription needed.Buy cheap medications online, no prescription needed.We have Viagra, Pherentermine, Levitra, Soma, Ambien, We have Viagra, Pherentermine, Levitra, Soma, Ambien, Tramadol and many more products. Tramadol and many more products. No embarrasing trips to the doctor, get it delivered directly to No embarrasing trips to the doctor, get it delivered directly to your door.your door.

Experienced reliable service.Experienced reliable service.Most trusted name brands.Most trusted name brands.

For your solution click here: http://www.webrx-doctor.com/?For your solution click here: http://www.webrx-doctor.com/?rid=1000 rid=1000

1. Standard spam mail

Page 4: A Suffix Tree Approach to Text Classification Applied to Email Filtering Rajesh Pampapathi, Boris Mirkin, Mark Levene School of Computer Science and Information

zygotes zoogenous zoometric zygosphene zygotactic zygoid zygotes zoogenous zoometric zygosphene zygotactic zygoid zucchettos zymolysis zoopathy zygophyllaceous zoophytologist zucchettos zymolysis zoopathy zygophyllaceous zoophytologist zygomaticoauricular zoogeologist zymoid zoophytish zoospores zygomaticoauricular zoogeologist zymoid zoophytish zoospores zygomaticotemporal zoogonous zygotenes zoogony zymosis zuza zygomaticotemporal zoogonous zygotenes zoogony zymosis zuza zoomorphs zythum zoonitic zyzzyva zoophobes zygotactic zoogenous zoomorphs zythum zoonitic zyzzyva zoophobes zygotactic zoogenous zombies zoogrpahy zoneless zoonic zoom zoosporic zoolatrous zombies zoogrpahy zoneless zoonic zoom zoosporic zoolatrous zoophilous zymotically zymosterol zoophilous zymotically zymosterol

FreeHYSHKRODMonthQGYIHOCSupply.IHJBUMDSTIPLIBJTJUBIYYXFreeHYSHKRODMonthQGYIHOCSupply.IHJBUMDSTIPLIBJTJUBIYYXFNFN

* GetJIIXOLDViagraPWXJXFDUUTabletsNXZXVRCBX * GetJIIXOLDViagraPWXJXFDUUTabletsNXZXVRCBX <http://healthygrow.biz/index.php?id=2> <http://healthygrow.biz/index.php?id=2>

zonally zooidal zoospermia zoning zoonosology zooplankton zonally zooidal zoospermia zoning zoonosology zooplankton zoochemical zoogloeal zoological zoologist zooid zoosphere zoochemical zoogloeal zoological zoologist zooid zoosphere zoochemical zoochemical

& Safezoonal andNGASXHBPnatural& Safezoonal andNGASXHBPnatural& TestedQLOLNYQandEAVMGFCapproved& TestedQLOLNYQandEAVMGFCapproved

zonelike zoophytes zoroastrians zonular zoogloeic zoris zygophore zonelike zoophytes zoroastrians zonular zoogloeic zoris zygophore zoograft zoophiles zonulas zygotic zymograms zygotene zootomical zoograft zoophiles zonulas zygotic zymograms zygotene zootomical zymes zoodendrium zygomata zoometries zoographist zygophoric zymes zoodendrium zygomata zoometries zoographist zygophoric zoosporangium zygotes zumatic zygomaticus zorillas zoocurrent zoosporangium zygotes zumatic zygomaticus zorillas zoocurrent zooxanthella zyzzyvas zoophobia zygodactylism zygotenes zooxanthella zyzzyvas zoophobia zygodactylism zygotenes zoopathological noZFYFEPBmas zoopathological noZFYFEPBmas <http://healthygrow.biz/remove.php> <http://healthygrow.biz/remove.php>

5. Embedded message (plus word salad)

Page 5: A Suffix Tree Approach to Text Classification Applied to Email Filtering Rajesh Pampapathi, Boris Mirkin, Mark Levene School of Computer Science and Information

Buy meds online and get it shipped to your door Find out more Buy meds online and get it shipped to your door Find out more herehere <http://www.gowebrx.com/?rid=1001> <http://www.gowebrx.com/?rid=1001>

a publications website accepted definition. known are can a publications website accepted definition. known are can Commons the be definition. Commons UK great public Commons the be definition. Commons UK great public principal work Pre-Budget but an can Majesty's many contains principal work Pre-Budget but an can Majesty's many contains statements statements titles (eg includes have website. health, statements statements titles (eg includes have website. health, these Committee Select undertaken described may these Committee Select undertaken described may publications publications

4. Word salads

Page 6: A Suffix Tree Approach to Text Classification Applied to Email Filtering Rajesh Pampapathi, Boris Mirkin, Mark Levene School of Computer Science and Information

Creating a Suffix TreeCreating a Suffix Tree

F

E

E

T

M

E

T

E

ROOT

E

E

T

T

T

MEET FEET

(1)

(1)

(2)

(1)

(1)(1)

(1)

(1)

(1)(1)

(1)

(1)

(1) (2)

(2)

(2)

(2)

(4)

Page 7: A Suffix Tree Approach to Text Classification Applied to Email Filtering Rajesh Pampapathi, Boris Mirkin, Mark Levene School of Computer Science and Information

Levels of InformationLevels of Information

Characters: the alphabet (and their Characters: the alphabet (and their frequencies) of a class.frequencies) of a class.

Matches: between query strings and a class.Matches: between query strings and a class.ss =nviaXgraU>Tabl$$$ets =nviaXgraU>Tabl$$$etst =xv^ia$graTab£££letst =xv^ia$graTab£££letsMatches(s, t) = {v, ia, gra, Tab, l, ets, $}Matches(s, t) = {v, ia, gra, Tab, l, ets, $}- But what about overlapping matches?- But what about overlapping matches?

Trees: properties of the class as a whole.Trees: properties of the class as a whole.~size~size~density (complexity)~density (complexity)

Page 8: A Suffix Tree Approach to Text Classification Applied to Email Filtering Rajesh Pampapathi, Boris Mirkin, Mark Levene School of Computer Science and Information

Document Similarity Document Similarity MeasureMeasure

n

0i

)T),i(d(score1

)T,d(SCORE

The score for a document, d, is the sum of the scores for each suffix:

d(i) is the suffix of d beginning at the ith letter

tau is a tree normalisation coefficient

Page 9: A Suffix Tree Approach to Text Classification Applied to Email Filtering Rajesh Pampapathi, Boris Mirkin, Mark Levene School of Computer Science and Information

Substring Similarity Substring Similarity MeasureMeasure

n

0tt )]m(p[)T|m(v)m(score

Score for match, m = m0m1m2…mn, is score(m):

T is the tree profile of the class.

v(m|T) is a normalisation coefficient based on the properties of T.

p(mt) is the probability of the character, mt, of the match m.

Φ[p] is a significance function.

Page 10: A Suffix Tree Approach to Text Classification Applied to Email Filtering Rajesh Pampapathi, Boris Mirkin, Mark Levene School of Computer Science and Information

Decision MechanismDecision Mechanism

HAMthreshold)T,d(SCORE

)T,d(SCORE

SS

HH

SPAMthreshold)T,d(SCORE

)T,d(SCORE

SS

HH

Page 11: A Suffix Tree Approach to Text Classification Applied to Email Filtering Rajesh Pampapathi, Boris Mirkin, Mark Levene School of Computer Science and Information

Specifications of Specifications of ΦΦ[p][p](character level)(character level)

ConstaConstant:nt:

1 1

Linear:Linear: pp

Square:Square: pp22

Root:Root: pp0.50.5

Logit:Logit: ln(p) – ln(1-ln(p) – ln(1-p)p)

SigmoiSigmoid:d:

(1 + exp(-(1 + exp(-p))p))-1-1

Note: Logit and Sigmoid need to be adjusted to fit in the range [0,1]

Page 12: A Suffix Tree Approach to Text Classification Applied to Email Filtering Rajesh Pampapathi, Boris Mirkin, Mark Levene School of Computer Science and Information

Significance functionSignificance function

Page 13: A Suffix Tree Approach to Text Classification Applied to Email Filtering Rajesh Pampapathi, Boris Mirkin, Mark Levene School of Computer Science and Information

Threshold VariationThreshold Variation~ Significance functions ~ Significance functions

~~

Page 14: A Suffix Tree Approach to Text Classification Applied to Email Filtering Rajesh Pampapathi, Boris Mirkin, Mark Levene School of Computer Science and Information

Threshold VariationThreshold Variation~ Significance functions ~ Significance functions

~~

Page 15: A Suffix Tree Approach to Text Classification Applied to Email Filtering Rajesh Pampapathi, Boris Mirkin, Mark Levene School of Computer Science and Information

Match normalisationMatch normalisation

Match unnormalisedMatch unnormalised 11

Match permutation Match permutation normalisednormalised

Match length normalisedMatch length normalised

)T|*m(i)i(f

)T|m(f

)T|'m(i)i(f

)T|m(f

m* is the set of all strings formed by permutations of m

m’ is the set of all strings of length equal to length of m

Page 16: A Suffix Tree Approach to Text Classification Applied to Email Filtering Rajesh Pampapathi, Boris Mirkin, Mark Levene School of Computer Science and Information

Match normalisationMatch normalisation

MUN: match unnormalised; MPN: permutation normalised; MLN: length normalised

Page 17: A Suffix Tree Approach to Text Classification Applied to Email Filtering Rajesh Pampapathi, Boris Mirkin, Mark Levene School of Computer Science and Information

Threshold VariationThreshold Variation~ match normalisation ~~ match normalisation ~

Constant significance functionunnormalised

Constant significance functionmatch normalised

Page 18: A Suffix Tree Approach to Text Classification Applied to Email Filtering Rajesh Pampapathi, Boris Mirkin, Mark Levene School of Computer Science and Information

Specifications of tauSpecifications of tau

UnnormalisUnnormalised: ed:

11

Size(T):Size(T): The total number of nodesThe total number of nodes

Density(T):Density(T): The average number of The average number of children of internal nodeschildren of internal nodes

AvFreq(T):AvFreq(T): Average frequency of nodesAverage frequency of nodes

Page 19: A Suffix Tree Approach to Text Classification Applied to Email Filtering Rajesh Pampapathi, Boris Mirkin, Mark Levene School of Computer Science and Information

Tree normalisationTree normalisation

Page 20: A Suffix Tree Approach to Text Classification Applied to Email Filtering Rajesh Pampapathi, Boris Mirkin, Mark Levene School of Computer Science and Information

Androutsopoulos et al. Androutsopoulos et al. (2000)(2000)

~ Ling-Spam Corpus ~~ Ling-Spam Corpus ~Pre-processingPre-processing Number Number of of

FeatureFeaturess

Spam Recall Spam Recall ErrorError

Spam Spam Precision Precision

ErrorError

Naïve Bayes (NB)Naïve Bayes (NB) Lemmatizer + Lemmatizer + Stop-ListStop-List

100100 17.22% 17.22% 0.51%0.51%

Suffix Tree (ST)Suffix Tree (ST) NoneNone N/AN/A 2.50%2.50% 0.21%0.21%

Naïve Bayes* (NB*)Naïve Bayes* (NB*) Lemmatizer + Stop-List

UnlimiteUnlimitedd

0.84%0.84% 2.86%2.86%

Pre-processingPre-processing Number Number of of

FeatureFeaturess

Spam Recall Spam Recall ErrorError

Spam Spam Precision Precision

ErrorError

Naïve Bayes (NB)Naïve Bayes (NB) Lemmatizer + Lemmatizer + Stop-ListStop-List

300300 36.95% 36.95% 0%0%

Suffix Tree (ST)Suffix Tree (ST) NoneNone N/AN/A 3.96%3.96% 0%0%

Naïve Bayes* (NB*)Naïve Bayes* (NB*) Lemmatizer + Stop-List

UnlimiteUnlimitedd

10.42%10.42% 0%0%

Page 21: A Suffix Tree Approach to Text Classification Applied to Email Filtering Rajesh Pampapathi, Boris Mirkin, Mark Levene School of Computer Science and Information

~ Ling-BKS Corpus ~~ Ling-BKS Corpus ~Pre-processingPre-processing False False

Positive Positive RateRate

False False Negative Negative

RateRate

Suffix Tree (ST)Suffix Tree (ST) NoneNone 0%0% 0%0%

Naïve Bayes* (NB*)Naïve Bayes* (NB*) Lemmatizer + Stop-Lemmatizer + Stop-ListList

0%0% 12.25%12.25%

~ SpamAssassin Corpus ~~ SpamAssassin Corpus ~Pre-processingPre-processing False False

Positive Positive RateRate

False False Negative Negative

RateRate

Suffix Tree (ST)Suffix Tree (ST) NoneNone 3.50%3.50% 3.25%3.25%

Naïve Bayes* (NB*)Naïve Bayes* (NB*) Lemmatizer + Stop-Lemmatizer + Stop-ListList

10.50%10.50% 1.50%1.50%

Page 22: A Suffix Tree Approach to Text Classification Applied to Email Filtering Rajesh Pampapathi, Boris Mirkin, Mark Levene School of Computer Science and Information

ConclusionsConclusions

Good overall classifierGood overall classifier- improvement on naïve Bayes- improvement on naïve Bayes- but there’s still room for - but there’s still room for improvementimprovement

Can one method ever maintain 100% Can one method ever maintain 100% accuracy?accuracy?

Extending the classifier Extending the classifier Applications to other domainsApplications to other domains

- web page classification- web page classification

Page 23: A Suffix Tree Approach to Text Classification Applied to Email Filtering Rajesh Pampapathi, Boris Mirkin, Mark Levene School of Computer Science and Information

Future Work - ODPFuture Work - ODP

Page 24: A Suffix Tree Approach to Text Classification Applied to Email Filtering Rajesh Pampapathi, Boris Mirkin, Mark Levene School of Computer Science and Information

Computational Computational PerformancePerformance

Data SetData Set Training Training (s)(s)

Av. Spam Av. Spam (ms)(ms)

Av. Ham Av. Ham (ms)(ms)

Av. Peak Av. Peak Mem.Mem.

LS-FULL (7.40MB)LS-FULL (7.40MB) 6363 843843 659659 765MB765MB

LS-11 (1.48MB)LS-11 (1.48MB) 3636 221221 206206 259MB259MB

SAeh-11 (5.16MB)SAeh-11 (5.16MB) 155155 504504 25282528 544MB544MB

BKS-LS-11 (1.12MB)BKS-LS-11 (1.12MB) 4141 161161 222 222 345MB345MB

Page 25: A Suffix Tree Approach to Text Classification Applied to Email Filtering Rajesh Pampapathi, Boris Mirkin, Mark Levene School of Computer Science and Information

Experimental Data SetsExperimental Data Sets

Ling-Spam (LS)Ling-Spam (LS)Spam (481) collected by Androutsopoulos et al. Spam (481) collected by Androutsopoulos et al. Ham (2412) from online linguists’ bulletin boardHam (2412) from online linguists’ bulletin board

Spam AssassinSpam Assassin- Easy (SAe)- Easy (SAe)- Hard (SAh)- Hard (SAh)Spam (1876) and ham (4176) examples donatedSpam (1876) and ham (4176) examples donated

BBKBBKSpam (652) collected by BirkbeckSpam (652) collected by Birkbeck

Page 26: A Suffix Tree Approach to Text Classification Applied to Email Filtering Rajesh Pampapathi, Boris Mirkin, Mark Levene School of Computer Science and Information

Androutsopoulos et al. Androutsopoulos et al. (2000)(2000)

~ Ling-Spam Corpus ~~ Ling-Spam Corpus ~Classifier Classifier ConfigurationConfiguration

ThresholThresholdd

No. of No. of AttribAttrib..

Spam Spam RecallRecall

Spam Spam PrecisiPrecisionon

BareBare 0.50.5 5050 81.10\%81.10\%

96.85\96.85\%%

Stop-ListStop-List 0.50.5 5050 82.35%82.35% 97.13%97.13%

LemmatizerLemmatizer 0.50.5 100100 82.35%82.35% 99.02%99.02%

Lemmatizer + Stop-Lemmatizer + Stop-ListList

0.50.5 100100 82.78% 82.78% 99.4999.49%%

BareBare 0.90.9 200200 76.94\%76.94\% 99.46\99.46\%%

Stop-ListStop-List 0.90.9 200200 76.11\%76.11\% 99.47\99.47\%%

LemmatizerLemmatizer 0.9 0.9 100100 77.57\%77.57\% 99.45\99.45\%%

Lemmatizer + Stop-Lemmatizer + Stop-listlist

0.9 0.9 100100 78.41\78.41\%%

99.47\99.47\%%

BareBare 0.9990.999 200200 73.82\%73.82\% 99.43\99.43\%%

Stop-ListStop-List 0.9990.999 200200 73.40\%73.40\% 99.43\99.43\%%

LemmatizerLemmatizer 0.9990.999 300300 63.67\%63.67\% 100.00\100.00\%%

Lemmatizer + Stop-Lemmatizer + Stop-ListList

0.9990.999 300300 63.05\63.05\%%

100.00100.00\% \%

Page 27: A Suffix Tree Approach to Text Classification Applied to Email Filtering Rajesh Pampapathi, Boris Mirkin, Mark Levene School of Computer Science and Information

Androutsopoulos et al. Androutsopoulos et al. (2000)(2000)

~ Ling-Spam Corpus ~~ Ling-Spam Corpus ~Classifier ConfigurationClassifier Configuration Spam Recall Spam Recall ErrorError

Spam Spam Precision Precision

ErrorError

Naïve Bayes (NB)Naïve Bayes (NB) Lemmatizer + Stop-Lemmatizer + Stop-ListList

17.22% 17.22% 0.51%0.51%

Suffix Tree (ST)Suffix Tree (ST) N/AN/A 2.5%2.5% 0.21%0.21%

Naïve Bayes* (NB*)Naïve Bayes* (NB*) Lemmatizer + Stop-List 0.84%0.84% 2.86%2.86%

Classifier ConfigurationClassifier Configuration Spam Recall Spam Recall ErrorError

Spam Spam Precision Precision

ErrorError

Naïve Bayes (NB)Naïve Bayes (NB) Lemmatizer + Stop-Lemmatizer + Stop-ListList

36.95% 36.95% 0%0%

Suffix Tree (ST)Suffix Tree (ST) N/AN/A 3.96%3.96% 0%0%

Naïve Bayes* (NB*)Naïve Bayes* (NB*) Lemmatizer + Stop-List 10.42%10.42% 0%0%

Page 28: A Suffix Tree Approach to Text Classification Applied to Email Filtering Rajesh Pampapathi, Boris Mirkin, Mark Levene School of Computer Science and Information

~ SpamAssassin Corpus ~ SpamAssassin Corpus ~~

Classifier ConfigurationClassifier Configuration Spam Spam RecallRecall

Spam Spam PrecisionPrecision

Naïve Bayes (NB)Naïve Bayes (NB) Lemmatizer + Stop-ListLemmatizer + Stop-List 82.78% 82.78% 99.49%99.49%

Suffix Tree (ST)Suffix Tree (ST) N/AN/A 97.50%97.50% 99.79%99.79%

Naïve Bayes* (NB*)Naïve Bayes* (NB*) Lemmatizer + Stop-ListLemmatizer + Stop-List 99.16%99.16% 97.14%97.14%

Classifier ConfigurationClassifier Configuration Spam Spam RecallRecall

Spam Spam PrecisionPrecision

Naïve Bayes (NB)Naïve Bayes (NB) Lemmatizer + Stop-ListLemmatizer + Stop-List 82.78% 82.78% 99.49%99.49%

Suffix Tree (ST)Suffix Tree (ST) N/AN/A 97.50%97.50% 99.79%99.79%

Naïve Bayes* (NB*)Naïve Bayes* (NB*) Lemmatizer + Stop-ListLemmatizer + Stop-List 99.16%99.16% 97.14%97.14%

Page 29: A Suffix Tree Approach to Text Classification Applied to Email Filtering Rajesh Pampapathi, Boris Mirkin, Mark Levene School of Computer Science and Information
Page 30: A Suffix Tree Approach to Text Classification Applied to Email Filtering Rajesh Pampapathi, Boris Mirkin, Mark Levene School of Computer Science and Information
Page 31: A Suffix Tree Approach to Text Classification Applied to Email Filtering Rajesh Pampapathi, Boris Mirkin, Mark Levene School of Computer Science and Information
Page 32: A Suffix Tree Approach to Text Classification Applied to Email Filtering Rajesh Pampapathi, Boris Mirkin, Mark Levene School of Computer Science and Information

Vector Space ModelVector Space Model

“What then?” sang Plato’s ghost, “What then?”

1 0 10 1 2 20

whathost plate Platoghost thensangbook

W. B. Yeats

50/1000P(w = ‘what’) = = 0.05

Word Probability

Page 33: A Suffix Tree Approach to Text Classification Applied to Email Filtering Rajesh Pampapathi, Boris Mirkin, Mark Levene School of Computer Science and Information

Creating ProfilesCreating Profiles

Mark

Page 34: A Suffix Tree Approach to Text Classification Applied to Email Filtering Rajesh Pampapathi, Boris Mirkin, Mark Levene School of Computer Science and Information

ProfilesProfiles

datadatabases information searchengines

dataintelligence criminal computationalpolice

Mark Levene

Mike Hu

Page 35: A Suffix Tree Approach to Text Classification Applied to Email Filtering Rajesh Pampapathi, Boris Mirkin, Mark Levene School of Computer Science and Information

ClassificationClassification

Boris Mirkin

Mark Levene

Mike Hu

SBM SML SMH

Page 36: A Suffix Tree Approach to Text Classification Applied to Email Filtering Rajesh Pampapathi, Boris Mirkin, Mark Levene School of Computer Science and Information

Naïve BayesNaïve Bayes(similarity measure)(similarity measure)

m

1i

jijj cdPcPdcP

M1k

kj

ij

ji

nM

n1cdP

~

(1)

For a document d = {d1d2d3 … dm }and set of classes c = {c1, c2 ... cJ}:

Where:

N

NcP

~ j

j (2)

(3)

Page 37: A Suffix Tree Approach to Text Classification Applied to Email Filtering Rajesh Pampapathi, Boris Mirkin, Mark Levene School of Computer Science and Information

CriticismsCriticisms

Pre-processing:Pre-processing:- Stop-word removal- Stop-word removal- Word stemming/lemmatisation- Word stemming/lemmatisation- Punctuation and formatting- Punctuation and formatting

Smallest unit of consideration is a Smallest unit of consideration is a word.word.

Classes (and documents) are bags of Classes (and documents) are bags of words, i.e. each word is independent words, i.e. each word is independent of all others.of all others.

Page 38: A Suffix Tree Approach to Text Classification Applied to Email Filtering Rajesh Pampapathi, Boris Mirkin, Mark Levene School of Computer Science and Information

Word DependenciesWord Dependencies

dataintelligence clustering computationalmeans

dataintelligence criminal computationalmeans

Boris Mirkin

Mike Hu

Page 39: A Suffix Tree Approach to Text Classification Applied to Email Filtering Rajesh Pampapathi, Boris Mirkin, Mark Levene School of Computer Science and Information

Word InflectionsWord Inflections

Intellig- OR intelligent

Intelligent

Intelligence

Intelligentsia

Intelligible

Page 40: A Suffix Tree Approach to Text Classification Applied to Email Filtering Rajesh Pampapathi, Boris Mirkin, Mark Levene School of Computer Science and Information

Success measuresSuccess measures RecallRecall is the proportion of is the proportion of

correctly classified correctly classified examples of a class. examples of a class.

If If SRSR is is spam recallspam recall, then , then (1-SR) gives the proportion (1-SR) gives the proportion of false negatives.of false negatives.

PrecisionPrecision is the is the proportion assigned to a proportion assigned to a class which are true class which are true members of that class. It is members of that class. It is a measure of the number a measure of the number of true positives. of true positives.

If If SPSP is is spam precisionspam precision, , then (1 – SP) would give then (1 – SP) would give the proportion of false the proportion of false positives.positives.

)HS(#)SS(#

)SS(#SR

)SH(#)SS(#

)SS(#SP

Page 41: A Suffix Tree Approach to Text Classification Applied to Email Filtering Rajesh Pampapathi, Boris Mirkin, Mark Levene School of Computer Science and Information

Success measuresSuccess measures True Positive Rate (TPR) True Positive Rate (TPR) is is

the proportion of correctly the proportion of correctly classified examples of the classified examples of the ‘positive’ class. ‘positive’ class.

Spam is typically taken as the Spam is typically taken as the positive class, so TPR is then positive class, so TPR is then the number of spam the number of spam classified as spam over the classified as spam over the total number of spam. total number of spam.

False Positive Rate (FPR) False Positive Rate (FPR) iis the proportion of the s the proportion of the ‘negatve’ class erroneously ‘negatve’ class erroneously assigned to the ‘positive’ assigned to the ‘positive’ class. class.

Ham is typically taken as the Ham is typically taken as the negative class, so FPR is then negative class, so FPR is then the number of ham classified the number of ham classified as spam over the total as spam over the total number of ham. number of ham.

TotalSpam

)SpamSpam(#TPR

TotalHamFPR

)SpamHam(#

Page 42: A Suffix Tree Approach to Text Classification Applied to Email Filtering Rajesh Pampapathi, Boris Mirkin, Mark Levene School of Computer Science and Information

Classifier StructureClassifier Structure

Training DataTraining Data

Profiling MethodProfiling Method

Profile RepresentationProfile Representation

Similarity/Comparison Similarity/Comparison MeasureMeasure

Decision Mechanism or Decision Mechanism or Classification CriterionClassification Criterion

DecisionDecision

Spam Ham

Spam Ham

?

Page 43: A Suffix Tree Approach to Text Classification Applied to Email Filtering Rajesh Pampapathi, Boris Mirkin, Mark Levene School of Computer Science and Information

Classification using a Classification using a suffix treesuffix tree

Method of profiling is construction of Method of profiling is construction of the treethe tree(no pre-processing, no post-processing)(no pre-processing, no post-processing)

The tree is a profile of the class. The tree is a profile of the class. Similarity measure?Similarity measure? Decision mechanism?Decision mechanism?

Page 44: A Suffix Tree Approach to Text Classification Applied to Email Filtering Rajesh Pampapathi, Boris Mirkin, Mark Levene School of Computer Science and Information

Threshold VariationThreshold Variation~ match normalisation ~~ match normalisation ~

Constant significance functionunnormalised

Constant significance functionmatch normalised

SPE = spam precision error; HPE = ham precision error

Page 45: A Suffix Tree Approach to Text Classification Applied to Email Filtering Rajesh Pampapathi, Boris Mirkin, Mark Levene School of Computer Science and Information

Threshold VariationThreshold Variation~ Significance functions ~ Significance functions

~~

SPE = spam precision error; HPE = ham precision error

Root function, no normalisation Logit function, no normalisation

Page 46: A Suffix Tree Approach to Text Classification Applied to Email Filtering Rajesh Pampapathi, Boris Mirkin, Mark Levene School of Computer Science and Information

Threshold VariationThreshold Variation

Constant significance function(unnormalised)

SPE = spam precision error; HPE = ham precision error