1
Goal: Goal: Learn to automatically File e-mails into folders Filter spam e-mail Motivation Motivation Information overload - we are spending more and more time filtering e-mails and organizing them into folders in order to facilitate retrieval when necessary Weaknesses of the programmable automatic filtering provided by the modern e-mail software (rules to organize mail into folders or spam mail filtering based on keywords): - Most users do not create such rules as they find it difficult to use the software or simply avoid customizing it - Manually constructing robust rules is difficult as users are constantly creating, deleting, reorganizing their folders - The nature of the e-mails within the folder may well drift over time and the characteristics of the spam e-mail (e.g. topics, frequent terms) also change over time => the rules must be constantly tuned by the user that is time consuming and error-prone LINGER LINGER Based on Text Categorization Bags of words representation – all unique words in the entire training corpus are identified Feature selection chooses the most important words and reduces dimensionality – Inf. Gain (IG), Variance (V) Feature representation – normalized weighting for every word, representing its importance in the document - Weightings: binary, term frequency, term frequency inverse document frequency (tf-idf) - Normalization at 3 levels (e-mail, mailbox, corpus) Classifier – neural network (NN), why? - require considerable time for parameter selection and training (-) but can achieve very accurate results; successfully applied in many real world applications (+) - NN trained with backpropagation, 1 output for each class, 1 hidden layer of 20-40 neurons; early stopping based on validation set or a max. # epochs (10 000) Pre-processing for words extraction recipient ((attachments seen as part of body) and a single bag of word representation is created for each e-mail - No stemming or stop wording were applied. Discarding words that only appear once in a corpus Removing words longer than 20 char from the body # unique words in a corpus reduced from 9000 to 1000 Corpora Corpora LINGER – A Smart Personal Assistant LINGER – A Smart Personal Assistant for E-mail Classification for E-mail Classification James Clark, Irena Koprinska, and Josiah James Clark, Irena Koprinska, and Josiah Poon Poon School of Information Technologies, University of Sydney, Sydney, Australia, e-mail: {jclark, irena, josiah}@it.usyd.edu.au Results Results Performance Measures accuracy (A), recall (R), precision (P) and F1measure Stratified 10-fold cross validation Filing Into Folders Overall Performance The simpler feature selector V is more effective than IG U2 and U4 were harder to classify than U1, U3 and U5: - different classification styles: U1, U3, U5 - based on the topic and sender, U2 - on the action performed (e.g. Read&Keep), U4 - on the topic, sender, action performed and also when e-mails needed to be acted upon (e.g. large ratio of the number of folders over the number of e-mails for U2 and U4 Effect of Normalization and Weighting Accuracy [%] for various normalization (e – e-mail, m- mailbox, g - global) and weighting (freq. – frequency, tf-idf and boolean) e-m ails # assigned to folderc i # notassigned to folderc i # from folderc i tp fn # notfrom folderc i fp tn fn fp tn tp tn tp A fp tp tp P , fn tp tp R , R P PR F 2 1 feature selection A [% ] R [% ] P [% ] F1 [% ] epochs U1 V 86.23 67.28 71.02 69.10 2150 IG 88.08 77.77 81.12 79.44 7960 U2 V 68.54 41.27 45.11 43.10 702 IG 61.46 37.07 41.94 39.36 1738 U3 V 92.34 76.18 77.45 76.81 3898 IG 74.55 45.02 47.41 46.18 5620 U4 V 79.48 40.39 39.46 39.92 4616 IG 65.23 24.97 22.51 23.68 4250 U5 V 83.40 81.66 84.75 83.18 4220 IG 70.16 64.79 71.32 67.90 9032 Spam Filtering Cost-Sensitive Performance Measures Blocking a legitimate message is times more costly than non-blocking a spam message Weighted accuracy (WA): when a legitimate e-mail is misclassified/correctly classified, this counts as errors/successes Overall Performance Performance on spam filtering for lemm corpora 3 scenarios: =1, no cost (flagging spam e-mail), =9, moderately accurate filter (notifying senders about blocked e- mails) and =999, highly accurate filter (completely automatic scenario) LingerIG – perfect results on both PU1 and LingSpam for all ; LingerV outperformed only by stumps and boosted trees Effect of Stemming and Stop Word List – do not help Performance on the 4 different versions of Anti-Spam Filter Portability Across Corpora a) and b) – low SP; many fp (non spam - Reason: different nature of legitimate e-mail in LingSpam (linguistics related) and U5Spam (more diverse) - Features selected from LingSpam are too specific, not a good predictor for U5Spam (a) b) – low SR as well; many tp (spam as non-spam) - Reason: U5Spam is considerably smaller than LingSpam learner U1 U2 U3 U4 kN N 1 50.60 40.60 73.00 34.00 kN N 5 71.90 24.70 59.60 48.40 kN N 30 71.50 47.50 46.80 36.40 SV M 1 79.40 40.60 80.70 58.50 SV M 2 75.80 37.90 75.60 51.60 D T (C4.5) 71.30 43.40 78.20 57.30 NB 69.70 38.70 47.10 24.10 Perceptron 72.00 39.70 77.30 59.30 W idrow-H off 80.60 48.50 82.40 60.90 cor pus # e- m ails # fol ders corpus # e- m ails # spam # legit. U1 545 7 PU 1 1099 481 618 U2 423 6 LingSpam 2893 481 2412 U3 888 11 U 5Spam 282 82 200 U4 926 19 U5 982 6 0 20 40 60 80 100 User1 User2 User3 User4 User5 e+freq. e+tf-idf m +freq . m +tf-idf g+freq. g+tf-idf bo o lea n =1 =9 =999 A [% ] SR [% ] SP [% ] SF1 [% ] WA [% ] WA [% ] LingSpam LingerV 98.20 93.56 95.62 94.58 99.01 99.13 LingerIG 100 100 100 100 100 100 NB 96.93 82.40 99.00 89.94 99.43 99.99 k-N N N /Av 88.60 97.40 92.79 99.40 N /Av Stacking N /Av 91.70 96.50 93.93 99.46 N /A v. Stum ps N /Av 97.92 98.33 98.12 99.76 99.95 TrBoost N /Av 97.30 98.53 97.91 99.77 99.99 PU 1 LingerV 93.45 88.36 96.46 92.23 96.69 97.41 LingerIG 100 100 100 100 100 100 NB N /Av 83.98 95.11 89.19 96.38 99.47 DT N /Av 89.81 88.71 89.25 N /Av N /Av Stum ps N /Av 96.47 97.48 96.97 98.58 99.66 TrBoost N /Av 96.88 98.52 97.69 99.14 99.98 0 25 50 75 100 bare lemm stop lem m _stop A SR SP SF1 features train test A SR SP SF1 Experim ent1 a Ling Spam Ling Spam U5 Spam 36.8 99.0 31.4 47.7 b U5 Spam U5 Spam Ling Spam 60.6 29.5 15.1 20.0 Experim ent2 c Ling Spam U5 Spam U5 Spam 87.9 62.2 94.4 75.0 d U5 Spam Ling Spam Ling Spam 98.8 94.6 98.1 96.3 # assigned as spam notspam spam 81 1 notspam 177 23 # assigned as spam notspam spam 159 322 notspam 832 1580

Goal: Learn to automatically File e-mails into folders Filter spam e-mail Motivation

Embed Size (px)

DESCRIPTION

LINGER – A Smart Personal Assistant for E-mail Classification James Clark, Irena Koprinska, and Josiah Poon School of Information Technologies, University of Sydney, Sydney, Australia, e-mail: {jclark, irena, josiah}@it.usyd.edu.au. Goal: Learn to automatically File e-mails into folders - PowerPoint PPT Presentation

Citation preview

Page 1: Goal:  Learn to automatically File e-mails into folders  Filter spam e-mail Motivation

Goal: Goal: Learn to automatically

File e-mails into folders

Filter spam e-mail

MotivationMotivationInformation overload - we are spending more and more time filtering e-mails and organizing them into folders in order to facilitate retrieval when necessary

Weaknesses of the programmable automatic filtering provided by the modern e-mail software (rules to organize mail into folders or spam mail filtering based on keywords):

- Most users do not create such rules as they find it difficult to use the software or simply avoid customizing it

- Manually constructing robust rules is difficult as users are constantly creating, deleting, reorganizing their folders

- The nature of the e-mails within the folder may well drift over time and the characteristics of the spam e-mail (e.g. topics, frequent terms) also change over time => the rules must be constantly tuned by the user that is time consuming and error-prone

LINGER LINGER Based on Text Categorization

Bags of words representation – all unique words in the entire training corpus are identified

Feature selection chooses the most important words and reduces dimensionality – Inf. Gain (IG), Variance (V)

Feature representation – normalized weighting for every word, representing its importance in the document

- Weightings: binary, term frequency, term frequency inverse document frequency (tf-idf)- Normalization at 3 levels (e-mail, mailbox, corpus)

Classifier – neural network (NN), why?

- require considerable time for parameter selection and training (-) but can achieve very accurate results; successfully applied in many real world applications (+)

- NN trained with backpropagation, 1 output for each class, 1 hidden layer of 20-40 neurons; early stopping based on validation set or a max. # epochs (10 000)

Pre-processing for words extraction E-mail fields used - body, sender (From, Reply-to), recipient (To, CC, Bcc) and Subject (attachments seen as part of body) - These fields are treated equally and a single bag of word representation is created for each e-mail - No stemming or stop wording were applied.Discarding words that only appear once in a corpusRemoving words longer than 20 char from the body# unique words in a corpus reduced from 9000 to 1000

CorporaCorpora

Filing into folders Spam e-mail filtering

4 versions of PU1 and LingSpam depending on whether stemming and stop word list were used (bare, lemm, stop and lemm_stop)

LINGER – A Smart Personal Assistant for E-mail LINGER – A Smart Personal Assistant for E-mail ClassificationClassification

James Clark, Irena Koprinska, and Josiah PoonJames Clark, Irena Koprinska, and Josiah Poon School of Information Technologies, University of Sydney,

Sydney, Australia, e-mail: {jclark, irena, josiah}@it.usyd.edu.au

Results Results

Performance Measures accuracy (A), recall (R), precision (P) and F1measure

Stratified 10-fold cross validation

Filing Into FoldersOverall Performance

The simpler feature selector V is more effective than IGU2 and U4 were harder to classify than U1, U3 and U5:

- different classification styles:

U1, U3, U5 - based on the topic and sender, U2 - on the action performed (e.g. Read&Keep), U4 - on the topic, sender, action performed and also when e-mails needed to be acted upon (e.g. ThisWeek)

-large ratio of the number of folders over the number of e-mails for U2 and U4Comparison with other classifiers

Effect of Normalization and Weighting

Accuracy [%] for various normalization (e – e-mail, m- mailbox, g - global) and weighting (freq. – frequency, tf-idf

and boolean)

Best results: mailbox level normalization, tf-idf and frequency weighting

e-mails # assigned

to folder ci # not assigned

to folder ci # from folder ci tp fn # not from folder ci fp tn

fnfptntp

tntpA

fptp

tpP

,

fntp

tpR

,

RP

PRF

21

feature selection

A [%]

R [%]

P [%]

F1 [%] epochs

U1 V 86.23 67.28 71.02 69.10 2150 IG 88.08 77.77 81.12 79.44 7960

U2 V 68.54 41.27 45.11 43.10 702 IG 61.46 37.07 41.94 39.36 1738

U3 V 92.34 76.18 77.45 76.81 3898 IG 74.55 45.02 47.41 46.18 5620

U4 V 79.48 40.39 39.46 39.92 4616 IG 65.23 24.97 22.51 23.68 4250

U5 V 83.40 81.66 84.75 83.18 4220 IG 70.16 64.79 71.32 67.90 9032

Spam FilteringCost-Sensitive Performance Measures Blocking a legitimate message is times more costly than non-blocking a spam message Weighted accuracy (WA): when a legitimate e-mail is misclassified/correctly classified, this counts as errors/successes

Overall Performance

Performance on spam filtering for lemm corpora

3 scenarios: =1, no cost (flagging spam e-mail), =9, moderately accurate filter (notifying senders about blocked e-mails) and =999, highly accurate filter (completely automatic scenario)LingerIG – perfect results on both PU1 and LingSpam for all ; LingerV outperformed only by stumps and boosted trees

Effect of Stemming and Stop Word List – do not help

Performance on the 4 different versions of LingSpam

Anti-Spam Filter Portability Across Corpora

a) and b) – low SP; many fp (non spam as spam)

- Reason: different nature of legitimate e-mail in LingSpam (linguistics related) and U5Spam (more diverse)

- Features selected from LingSpam are too specific, not a good predictor for U5Spam (a) b) – low SR as well; many tp (spam as non-spam)

- Reason: U5Spam is considerably smaller than LingSpam

a) Typical confusion matrices b)

c) and d) – good results

-Not perfect feature selection but the NN is able to recover by trainingMore extensive experiments with diverse, non topic specific corpora, are needed to determine the portability of anti-spam filters across different users

learner U1 U2 U3 U4 kNN1 50.60 40.60 73.00 34.00 kNN5 71.90 24.70 59.60 48.40 kNN30 71.50 47.50 46.80 36.40 SVM1 79.40 40.60 80.70 58.50 SVM2 75.80 37.90 75.60 51.60 DT (C4.5) 71.30 43.40 78.20 57.30 NB 69.70 38.70 47.10 24.10 Perceptron 72.00 39.70 77.30 59.30 Widrow-Hoff 80.60 48.50 82.40 60.90

corpus

# e-mails

# fol ders

corpus # e- mails

# spam

# legit.

U1 545 7 PU1 1099 481 618 U2 423 6 LingSpam 2893 481 2412 U3 888 11 U5Spam 282 82 200 U4 926 19 U5 982 6

0

20

40

60

80

100

User1 User2 User3 User4 User5

e+freq. e+tf-idf m+freq. m+tf-idfg+freq. g+tf-idf boolean

=1 =9 =999 A

[%] SR [%]

SP [%]

SF1 [%]

WA [%]

WA [%]

LingSpam LingerV 98.20 93.56 95.62 94.58 99.01 99.13 LingerIG 100 100 100 100 100 100 NB 96.93 82.40 99.00 89.94 99.43 99.99 k-NN N/Av 88.60 97.40 92.79 99.40 N/Av Stacking N/Av 91.70 96.50 93.93 99.46 N/Av. Stumps N/Av 97.92 98.33 98.12 99.76 99.95 TrBoost N/Av 97.30 98.53 97.91 99.77 99.99

PU1 LingerV 93.45 88.36 96.46 92.23 96.69 97.41 LingerIG 100 100 100 100 100 100 NB N/Av 83.98 95.11 89.19 96.38 99.47 DT N/Av 89.81 88.71 89.25 N/Av N/Av Stumps N/Av 96.47 97.48 96.97 98.58 99.66 TrBoost N/Av 96.88 98.52 97.69 99.14 99.98

0

25

50

75

100

bare lemm stop lemm_stop

A SR SP SF1

features train test A SR SP SF1 Experiment 1

a Ling Spam

Ling Spam

U5 Spam

36.8 99.0 31.4 47.7

b U5 Spam

U5 Spam

Ling Spam

60.6 29.5 15.1 20.0

Experiment 2 c Ling

Spam U5 Spam

U5 Spam

87.9 62.2 94.4 75.0

d U5 Spam

Ling Spam

Ling Spam

98.8 94.6 98.1 96.3

# assigned as spam not spam spam 81 1 not spam 177 23

# assigned as spam not spam spam 159 322 not spam 832 1580