www.sophos.com
Adaptive Filtering: One Year On
John Graham-CummingResearch Director, Sophos’s Anti-Spam Task Force
Author, POPFile
Adaptive Filtering
Definition: An email filter that can be taught to recognize different types of mail without writing rules.
Most use some machine learning technique: Naïve Bayesian Classification1
knn2
Support Vector Machines3
All provide some measure of “spamminess”
Machine Learning & Anti-spam A little more than one year Papers
Mar 1998: SpamCop: A Spam Classification & Organization Program1
Jul 1998: A Bayesian Approach to Filtering Junk E-mail2 2000: An evaluation of Naive Bayesian anti-spam
filtering3
Aug 2002: A Plan for Spam4
Patents Jun 1998: 6,161,130: Technique which utilizes a
probabilistic classifier to detect "junk" e-mail by automatically updating a training and re-training the classifier based on the updated training set
Jun 1999: 6,592,627: System and method for organizing repositories of semi-structured documents such as email
Why now?
The “Grandma Problem”
Confluence of events: Spam getting close to 50% of all mail1
Email reaching 1/3 of adults in US2
Fast processors can handle the processing load
No other good alternatives Laws? Migrate from SMTP?3
Two Routes
Open Source Lots of open source anti-spam solutions Many are “wannabe” solutions that simply
implemented Paul Graham’s ideas Some are interesting tools (bogofilter, POPFile,
SpamBayes) Commercial
Vendors now incorporating Adaptive Filtering into their anti-spam products
Classic tradeoff: Free, open source, community supported Fee, “productized”, vendor supported
Practical Open Source Filters General mail filters1
Aug 1996: ifile Aug 2002: POPFile Oct 2002: dbacl
Spam Filters2
Bogofilter, SpamBayes, Bayesian Spam Filter, SpamProbe, SpamWizard, BSpam, The Spam Secretary, Expaminator, SqueakyMail, Bayespam, spaminator, Quick Spam Filter, Annoyance Filter, DSPAM, PASP, Spam Blocker, CRM114
SpamAssassin (added Bayesian in 2.5)
Mainstream Adaptive Filtering General
SwiftFile (for Lotus Notes)1
Ella Pro (for Microsoft Outlook)2
Anti-spam Desktop Mozilla 1.3, Eudora 6.0 Microsoft MSN 8, Microsoft Outlook 2003 AOL 9.0, Apple Mail.app (Jaguar)
Anti-spam Gateway Sophos PureMessage 4.x
Prediction: By end of 2004 every major email client includes adaptive filtering
The Problems
Man-in-the-street Usability
False Positives
Over training
One man’s spam is another man’s ham
Internationalization
Usability
Proxy, plug-in and external filters are too complex
General user needs: To not understand the underlying mechanism Complete integration with mail client Obvious operation (e.g. spam is moved into a
folder call Spam) Automatic whitelisting (if I send to Mom, Mom
is ok)
False Positives
False Positive == Good mail identified as bad
False Negative == Spam identified as good
People tolerate false negatives, but hate false positives
Spam filters must guard against false positives: Bias towards False Negatives (“A Plan for Spam”) Cross check results (SpamBayes) High spam threshold
Over Training
Occurs when user loads up adaptive filter with lots more spam than ham e.g. feeds entire spam archive into filter
Some adaptive filters then think everything is spam
For Naïve Bayes classifiers the “train on errors” methodology works well in practice. User teaches filter only on mails it incorrectly
classified “No, that’s spam or no, that’s ham” button
One man’s spam…
Can be hard to unsubscribe from legitimate bulk mail
Users tell spam filter that legitimate mail is spam Creates false positives for other users in shared
systems e.g. I say CNET News email is spam, you want
it Ideal system has two parts
Gateway spam filter run by IT group Individual preferences on each client
Internationalization
Tokenization non-trivial for some languages In English words are “space separated” Thisisnotthecaseinsomeotherlanguages:
Japanese (POPFile の特別な使い方 )
Different punctuation ¿Español? «Français»
UTF-8, Unicode تقارير و looks like ÃÎÈÇÑ æ ÊÞÇÑíÑ أخبار
Spammer’s Response
Overwhelm filter with “good words”
Hide those good words from people
Use HTML as trickery toolbox
Three techniques: And the Kitchen Sink Invisible Ink Camouflage
More in Sophos’s Field Guide to Spam1
And the Kitchen Sink
Throw in innocent words before or after the HTML
<html><body>Viagra</body></html>
Hi, Johnny! It was really nice to have dinner with you last night. See you soon, love Mom
And the Kitchen Sink
Spammer hopes reader concentrates on the spam message part
Ineffective because user gets to see the innocent words
Spammers need ways to hide the innocent words
So they’ve taken inspiration from search engine trickery…
Invisible Ink
Use HTML font colors to write white on white
<body bgcolor=white>Viagra<font color=white>Hi, Johnny! It was really nice to have dinner with you last night. See you soon, love Mom</font></body>
Invisible Ink
Easily spotted if filter groks HTML
Can confuse filters that just drop HTML tags
Spammers have noticed that Invisible Ink is being targeted
They’ve adapted…
Camouflage
Use very similar HTML colors
<body bgcolor=#113333>
<font color=yellow>Viagra</font>
<font color=#123939>some innocent words</font>
</body>
Camouflage
Hard to see, but “some innocent words” do appear
Pythagoras Spots Spam
Foreground and background colors are coordinates in 3D
Imagine a Red axis, a Green axis and a Blue
(00,00,00)
Sweet, I rule in 2003
• Similar colors are close• Dissimilar colors are far apart• Pythagoras’ Theorem (3D)1 gives the color distance
(11,33,33)
(12,39,39)
(FF,FF,00)
●●
●
Blue
Red
Green
Spammers love HTML
Spams using HTML
84% 83% 85% 84% 84%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
Dec-02 Jan-03 Feb-03 Mar-03 Apr-03
% M
essa
ges
Trick Trends - Two Increasing
Two Tricks Showing Gains
0%
5%
10%
15%
20%
25%
30%
Dec-02 Jan-03 Feb-03 Mar-03 Apr-03
% M
essa
ges
HTML CommentsInvisible Ink
Tricks Make Spam Spotting Easier Bad news for spammers:
The harder you try to obscure your messages the easier they are to filter
Spam trickery becomes the spam fingerprint Bad news for end users:
Spammers will react by making spam more innocent
Hi, I saw your profile and wanted to get in touch, please check out my site at www.some-viagra-site.com
The Filter Paradox
Do filters make spam more effective? One spammer claimed on /.
“Your filters help cut down on the complaints to ISPs […] you no longer complain to [email protected], my access providers, or anyone else who might cause me problems”
Time will tell
The End
Following slides are for reference purposes
References
Slide 21. http://www.wikipedia.org/wiki/
Naive_Bayesian_classification
2. http://www.usenix.org/events/sec02/full_papers/liao/liao_html/node4.html
3. http://citeseer.nj.nec.com/tong00support.html
References
Slide 31. http://citeseer.nj.nec.com/pantel98spamcop.html2. http://citeseer.nj.nec.com/sahami98bayesian.html3. http://citeseer.nj.nec.com/androutsopoulos00evaluation.html4. http://www.paulgraham.com/spam.html
Slide 41. Wired, p50, September 2003 predicts 50% of all mail
will be spam by September 20042. US Census Bureau, 20003. One proposal is AMTP:
http://www.ietf.org/internet-drafts/draft-weinman-amtp-00.txt
References
Slide 51. POPFile:
http://popfile.sourceforge.netifile: http://www.nongnu.org/ifile/
2. Search SourceForge and Freshmeat Slide 6
1. http://www.research.ibm.com/swiftfile/
2. http://www.openfieldsoftware.com/Ella.asp
References
Slide 17
1. http://www.activestate.com/Products/PureMessage/Field_Guide_to_Spam/
Pythagoras in 3D
Distance between two points in space
Pythagoras: δ2 = α2 + β2
Pythagoras: α2 = (x-a)2 + (z-c)2
β2 = (y-b)2
(a, b, c)
(x, y, z)
δ
α
βδ = √ ( (x-a)2 + (y-b)2 + (z-c)2 )