Upload
fallon-cooke
View
12
Download
2
Embed Size (px)
DESCRIPTION
LINGER – A Smart Personal Assistant for E-mail Classification James Clark, Irena Koprinska, and Josiah Poon School of Information Technologies, University of Sydney, Sydney, Australia, e-mail: {jclark, irena, josiah}@it.usyd.edu.au. Goal: Learn to automatically File e-mails into folders - PowerPoint PPT Presentation
Citation preview
Goal: Goal: Learn to automatically
File e-mails into folders
Filter spam e-mail
MotivationMotivationInformation overload - we are spending more and more time filtering e-mails and organizing them into folders in order to facilitate retrieval when necessary
Weaknesses of the programmable automatic filtering provided by the modern e-mail software (rules to organize mail into folders or spam mail filtering based on keywords):
- Most users do not create such rules as they find it difficult to use the software or simply avoid customizing it
- Manually constructing robust rules is difficult as users are constantly creating, deleting, reorganizing their folders
- The nature of the e-mails within the folder may well drift over time and the characteristics of the spam e-mail (e.g. topics, frequent terms) also change over time => the rules must be constantly tuned by the user that is time consuming and error-prone
LINGER LINGER Based on Text Categorization
Bags of words representation – all unique words in the entire training corpus are identified
Feature selection chooses the most important words and reduces dimensionality – Inf. Gain (IG), Variance (V)
Feature representation – normalized weighting for every word, representing its importance in the document
- Weightings: binary, term frequency, term frequency inverse document frequency (tf-idf)- Normalization at 3 levels (e-mail, mailbox, corpus)
Classifier – neural network (NN), why?
- require considerable time for parameter selection and training (-) but can achieve very accurate results; successfully applied in many real world applications (+)
- NN trained with backpropagation, 1 output for each class, 1 hidden layer of 20-40 neurons; early stopping based on validation set or a max. # epochs (10 000)
Pre-processing for words extraction E-mail fields used - body, sender (From, Reply-to), recipient (To, CC, Bcc) and Subject (attachments seen as part of body) - These fields are treated equally and a single bag of word representation is created for each e-mail - No stemming or stop wording were applied.Discarding words that only appear once in a corpusRemoving words longer than 20 char from the body# unique words in a corpus reduced from 9000 to 1000
CorporaCorpora
Filing into folders Spam e-mail filtering
4 versions of PU1 and LingSpam depending on whether stemming and stop word list were used (bare, lemm, stop and lemm_stop)
LINGER – A Smart Personal Assistant for E-mail LINGER – A Smart Personal Assistant for E-mail ClassificationClassification
James Clark, Irena Koprinska, and Josiah PoonJames Clark, Irena Koprinska, and Josiah Poon School of Information Technologies, University of Sydney,
Sydney, Australia, e-mail: {jclark, irena, josiah}@it.usyd.edu.au
Results Results
Performance Measures accuracy (A), recall (R), precision (P) and F1measure
Stratified 10-fold cross validation
Filing Into FoldersOverall Performance
The simpler feature selector V is more effective than IGU2 and U4 were harder to classify than U1, U3 and U5:
- different classification styles:
U1, U3, U5 - based on the topic and sender, U2 - on the action performed (e.g. Read&Keep), U4 - on the topic, sender, action performed and also when e-mails needed to be acted upon (e.g. ThisWeek)
-large ratio of the number of folders over the number of e-mails for U2 and U4Comparison with other classifiers
Effect of Normalization and Weighting
Accuracy [%] for various normalization (e – e-mail, m- mailbox, g - global) and weighting (freq. – frequency, tf-idf
and boolean)
Best results: mailbox level normalization, tf-idf and frequency weighting
e-mails # assigned
to folder ci # not assigned
to folder ci # from folder ci tp fn # not from folder ci fp tn
fnfptntp
tntpA
fptp
tpP
,
fntp
tpR
,
RP
PRF
21
feature selection
A [%]
R [%]
P [%]
F1 [%] epochs
U1 V 86.23 67.28 71.02 69.10 2150 IG 88.08 77.77 81.12 79.44 7960
U2 V 68.54 41.27 45.11 43.10 702 IG 61.46 37.07 41.94 39.36 1738
U3 V 92.34 76.18 77.45 76.81 3898 IG 74.55 45.02 47.41 46.18 5620
U4 V 79.48 40.39 39.46 39.92 4616 IG 65.23 24.97 22.51 23.68 4250
U5 V 83.40 81.66 84.75 83.18 4220 IG 70.16 64.79 71.32 67.90 9032
Spam FilteringCost-Sensitive Performance Measures Blocking a legitimate message is times more costly than non-blocking a spam message Weighted accuracy (WA): when a legitimate e-mail is misclassified/correctly classified, this counts as errors/successes
Overall Performance
Performance on spam filtering for lemm corpora
3 scenarios: =1, no cost (flagging spam e-mail), =9, moderately accurate filter (notifying senders about blocked e-mails) and =999, highly accurate filter (completely automatic scenario)LingerIG – perfect results on both PU1 and LingSpam for all ; LingerV outperformed only by stumps and boosted trees
Effect of Stemming and Stop Word List – do not help
Performance on the 4 different versions of LingSpam
Anti-Spam Filter Portability Across Corpora
a) and b) – low SP; many fp (non spam as spam)
- Reason: different nature of legitimate e-mail in LingSpam (linguistics related) and U5Spam (more diverse)
- Features selected from LingSpam are too specific, not a good predictor for U5Spam (a) b) – low SR as well; many tp (spam as non-spam)
- Reason: U5Spam is considerably smaller than LingSpam
a) Typical confusion matrices b)
c) and d) – good results
-Not perfect feature selection but the NN is able to recover by trainingMore extensive experiments with diverse, non topic specific corpora, are needed to determine the portability of anti-spam filters across different users
learner U1 U2 U3 U4 kNN1 50.60 40.60 73.00 34.00 kNN5 71.90 24.70 59.60 48.40 kNN30 71.50 47.50 46.80 36.40 SVM1 79.40 40.60 80.70 58.50 SVM2 75.80 37.90 75.60 51.60 DT (C4.5) 71.30 43.40 78.20 57.30 NB 69.70 38.70 47.10 24.10 Perceptron 72.00 39.70 77.30 59.30 Widrow-Hoff 80.60 48.50 82.40 60.90
corpus
# e-mails
# fol ders
corpus # e- mails
# spam
# legit.
U1 545 7 PU1 1099 481 618 U2 423 6 LingSpam 2893 481 2412 U3 888 11 U5Spam 282 82 200 U4 926 19 U5 982 6
0
20
40
60
80
100
User1 User2 User3 User4 User5
e+freq. e+tf-idf m+freq. m+tf-idfg+freq. g+tf-idf boolean
=1 =9 =999 A
[%] SR [%]
SP [%]
SF1 [%]
WA [%]
WA [%]
LingSpam LingerV 98.20 93.56 95.62 94.58 99.01 99.13 LingerIG 100 100 100 100 100 100 NB 96.93 82.40 99.00 89.94 99.43 99.99 k-NN N/Av 88.60 97.40 92.79 99.40 N/Av Stacking N/Av 91.70 96.50 93.93 99.46 N/Av. Stumps N/Av 97.92 98.33 98.12 99.76 99.95 TrBoost N/Av 97.30 98.53 97.91 99.77 99.99
PU1 LingerV 93.45 88.36 96.46 92.23 96.69 97.41 LingerIG 100 100 100 100 100 100 NB N/Av 83.98 95.11 89.19 96.38 99.47 DT N/Av 89.81 88.71 89.25 N/Av N/Av Stumps N/Av 96.47 97.48 96.97 98.58 99.66 TrBoost N/Av 96.88 98.52 97.69 99.14 99.98
0
25
50
75
100
bare lemm stop lemm_stop
A SR SP SF1
features train test A SR SP SF1 Experiment 1
a Ling Spam
Ling Spam
U5 Spam
36.8 99.0 31.4 47.7
b U5 Spam
U5 Spam
Ling Spam
60.6 29.5 15.1 20.0
Experiment 2 c Ling
Spam U5 Spam
U5 Spam
87.9 62.2 94.4 75.0
d U5 Spam
Ling Spam
Ling Spam
98.8 94.6 98.1 96.3
# assigned as spam not spam spam 81 1 not spam 177 23
# assigned as spam not spam spam 159 322 not spam 832 1580