9/20031 Classifying and Filtering Spam Using Search Engines Oleg Kolesnikov College of Computing Georgia Tech

9/2003 1

Classifying and Filtering Spam

Using Search Engines

Oleg Kolesnikov

College of Computing

Georgia Tech

9/2003 2

>50% of all e-mail today is spam?

Source: brightmail.com

9/2003 3

Scale

• IDC: of 31bn messages sent each day, 18%, or 5.6bn were s[pc]am messages

• Brightmail decoy network stats:

6.7 bn spam messages sent in March, 2003, varying from 100 to ~100,000 identical e-mails sent at a time

9/2003 4

Current techniques to deal with SPAM/UCE:

• Blacklisting

• Signature-based Filtering

• Statistical/Bayesian Filtering

• Heuristic Filtering

• Challenge-Response Filtering

• Sender-pays

• Laws

9/2003 5

Blacklisting

• MAPS (Mail Abuse Prevention System) RBL catches only 24% of spam with 34% false positives (the spam police article, gaudi/gaspar)

• Self-appointed sheriffs/vigilantes, legitimate business increasingly caught in crossfire, e.g. iBill was losing $100k/day during each of the four days of blacklisting

• Only a first cut at the problem, never b-lists more than 50% of the servers sending spam (Graham)

9/2003 6

Sample and Signature-based Filtering

• Set up a network of DECOY e-mail addresses. Any messages sent to these addresses must be spam=>if the same message is sent to a protected address, the message must be SPAM, too (that’s what Brightmail does)

• Not very flexible -- spammers take the lead in coming up with tricks

• Make each spam different

9/2003 7

Brightmail (used by MS/Hotmail, Earthlink, Verizon, ebay etc. )

9/2003 8

Basic Statistical Filtering

• W: Must be TRAINED, S: relatively low false positives

• Starts with two message corpuses -- spam and legitimate

• Splits messages into TOKENs

• Assigns each token a probability, based on the probability of its appearance in spam corpus

e.g. ‘naked’ may have 67% probability of appearing in spam, say vs. ‘regards’ -- 10%

• when a new message arrives, stat filter takes top N tokens with the probability that is the farthest from the middle 50% both ways, applies Bayesian Theorem, and comes up with a RANKING for the e-mail

9/2003 9

Heuristic Filtering

• What kind of filters can you come up with JUST BY LOOKING at a spam e-mail?

• Sender name looks bogus?• Header fields are missing?• Lots of html?• Take all these rules and heuristic observations, assign

weights/points, and put them into a database• You’ve got yourself an early version of

SPAMASSASSIN

9/2003 10

SpamAssassin

• The way you can make it work (let’s say with postfix):

1) perl -MCPAN -e ‘install Mail::SpamAssassin’

2) learn on database of spam and legitimate e-mails using sa-learn (part of spamassassin)

3) add a filter program to filter all incoming mail through spamc, a part of spamassassin:

/usr/bin/spamc | /usr/sbin/sendmail -i “$@”; exit $?

4) spamc adds headers, something like:

X-Spam-Flag: {YES|NO}, X-Spam-Level: ***

5) The headers are caught by a user’s procmail recipe and mail is classified appropriately

9/2003 11

Heuristic Filtering Two

• W: Public heuristic rules database; makes it relatively easy for spammers to come up with way to bypass the system => The rules database needs to be updated frequently

• May not be as effective today as other methods, such as stat filtering

9/2003 12

Challenge-Response Filtering

• Whenever you receive an e-mail from someone NOT on your whitelist, an automatic reply is sent telling what steps the sender should take to be considered for the whitelist (e.g. send you a confirmation, make a donation, solve a puzzle, etc.)

• Very effective at stopping spam BUT has a number of drawbacks: valid mail delayed, kind of harsh -- some may think of it as inconsiderate and never reply, extra work for senders etc.

9/2003 13

Stats for different approaches (MessageLabs)

MAPS/RBL Sample/Signature

Statistical Heuristicand Rule-

based

Falsenegatives

40-100% 20% ~1%* 5%

Falsepositives

10% 2% 0.1%* 0.5%

* See next slide

9/2003 14

Problems with Statistical and other keyword-dependent methods

• 1) Heavily dependent on effective parsing and the presence of “true” tokens, e.g. spammers fooling parsers:

Examples:

– White background:

<font color=white>research data and other statistically strong keywords that are present in legitimate e-mails</font>

– Splitting words:

check this porn

– Adding extra characters and spaces to confuse parsers (F*R E-E)

and so forth (javascript, fake html tags, browser-specific tricks) 2)

• 2) Spam may contain too little text and be TOO close to real e-mails in keywords. This is a more serious problem. I’ll give an example later.

9/2003 15

My research

• Developed and implemented a system for filtering of unwanted mail using Google

• Can be used WITHOUT training

9/2003 16

Classification of current spam

9/2003 17

Thoughts

• Some users must click on those ads or else there would be no spam (somebody IS interested in it after all)

• There may be more of such users in the future as new regulations appear and spam becomes less of an annoyance and more of an ad

• Some users may like to receive SPAM-looking messages, for instance, marketing reports, offers, etc., that look very much like spam

9/2003 18

Two main observations I use

• Spam is USER-SPECIFIC

• Most spammers expect users to TAKE some ACTION upon reading spam; in other words, there has to be a FEEDBACK mechanism

9/2003 19

Targeting the feedback mechanism

• How effective would a spam be without an easy feedback mechanism?

9/2003 20

URLs as a feedback mechanism

• Of ~1800 spam messages in the classical spam corpuses I have analyzed, ~95% of messages contained URLs

• Of the remaining 5%, approximately 1/2 seemed to be damaged submissions (i.e. MIME conversion and other types of errors), the rest consisted of two types of letters:

– Messages with 1-800 numbers and faxes (including Nigerian scam)

– Religious letters

9/2003 21

Basic Approach: URLSP

• The basic approach was to extract URLs, apply a user-specific whitelist based on a user’s mailbox (masks such as .edu, cnn.com etc.) and classify everything else as spam

• The first version I implemented has been in use at Tech since December’02

• Has actually been working quite well

9/2003 22

Effective but rather naive

• First version effective but rather naive

• Granularity and false positives can be a problem

9/2003 23

Next version: Classifying URLs

• CLASSIFY URLs using Google and Open Directory

• Use whitelists/blacklists of categories and URLs BASED on user mailbox and individual preferences

9/2003 24

DMOZ/ODP

9/2003 25

Example

• Based on files automatically generated from your mailbox, configure the system as follows (blacklist* f. are omitted):

whitelist.url:.edu, .mil, .gov, www.nmap.com, www.epic.org, www.cypherpunks.to etc.

whitelist.cat:Top/Computers/Security/Anti_Virus/Products

Top/Computers/Security/Products_and_Tools/Cryptography/PGP

Top/Computers/Security/Products_and_Tools/Password_Tools

...

9/2003 26

URL Classifier: Categories Extracted from SPAM

• Examples of categories of URLs extracted from spam:

Top/Business/Consumer_Goods_and_Services/Beauty/Cosmetics

Top/Business/Employment/Careers

Top/Business/Financial_Services/Mortgages

Top/Business/Investing/Day_Trading/Brokerages

Top/Business/Investing/Day_Trading/Education_and_Training

Top/Business/Investing/News_and_Media/Newsletters/Stocks_and_Bonds

Top/Business/Marketing_and_Advertising/Direct_Marketing/Mailing_Lists/MLM

Top/Regional/North_America/Canada/Business_and_Economy/Employment/Job_Search

Top/Shopping/Gifts/Personalized

Top/Shopping/Home_and_Garden/Kitchen_and_Dining/Appliances/Parts

...

9/2003 27

GTUC v1.0 (Basic)

• Register for a free account on a CoC-based filtering server

• Forward your mail to the server

• The mail will be automatically classified into three folders as it arrives– Inbox, Unknown, spam-can

• Read your mail with IMAP

9/2003 28

Spam of the future

• Innovative feedback mechanisms

• Appearance as close to legitimate e-mails as possible, e.g.>>>

From: [email protected]

Hi, here is an interesting article. You should check it out -- net::“terminator_25”

Roberto Carlos

9/2003 29

Solution

• Current best--Combination of approaches

• Categorization and URL-based filtering can help

• Uncategorized URLs? Similarity + retrieval of html and categorization with token stats/heuristics

Documents

9/20031 Classifying and Filtering Spam Using Search Engines Oleg Kolesnikov College of Computing Georgia Tech