81
presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

Embed Size (px)

Citation preview

Page 1: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

SPAMChristian LozaSrikanth PallaLiqin Zhang

Page 2: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Overview

IntroductionBackgroundMeasurementMethodsCompare different methodsConclusions

Page 3: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Introduction

If you use email, it's likely that you've recently been visited by a piece of Spam- an unsolicited, unwanted messag, sent to you with out your permission.Sending spam violates the Acceptable Use Policy (AUP)of almost all ISP's and can lead to the termination of the sender's account.As the recipient directly bears the cost of delivery, storage, and processing, one could regard spam as the electronic equivalent of "postage-due" junk mail.

Page 4: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Introduction

Spammers frequently engage in deliberate fraud to send out their messages. Spammers often use false names, addresses, phone numbers, and other contact information to set up "disposable" accounts at various Internet service providers. They also often use falsified or stolen credit card numbers to pay for these accounts. This allows them to move quickly from one account to the next as the host ISPs discover and shut down each one.

Page 5: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Introduction

In recent years, the spam has show no signals of stopping growth

This is mainly because it does work

The advantage is that is a cheap way to increase the customer base.

Page 6: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Spammers frequently engage in deliberate fraud to send out their messages. Spammers often use false names, addresses, phone numbers, and other contact information to set up "disposable" accounts at various Internet service providers. They also often use falsified or stolen credit card numbers to pay for these accounts. This allows them to move quickly from one account to the next as the host ISPs discover and shut down each one.

Page 7: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Spammers frequently go to great lengths to conceal the origin of their messages. They do this by spoofing e-mail addresses . The spammer hacks the email protocol SMTP so that a message appears to originate from another email address. Some ISPs and domains require the use of SMTP AUTH allowing positive identification of the specific account from which an e-mail originates.

Page 8: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

One cannot completely spoof an e-mail address chain, since the receiving mailserver records the actual connection from the last mailserver's IP address; however, spammers can forge the rest of the ostensible history of the mailservers the e-mail has ostensibly traversed. Spammers frequently seek out and make use of vulnerable third-party systems such as open mail relays and open proxy servers.

Page 9: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Address Collection

Spammers may harvest e-mail addresses from a number of sources. A popular method uses e-mail addresses which their owners have published for other purposes. Usenet posts, especially those in archives such as Google groups, frequently yield addresses. Simply searching the Web for pages with addresses ― such as corporate staff directories ― can yield thousands of addresses, most of them deliverable.

Page 10: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Address Collection

Spammers have also subscribed to discussion mailing lists for the purpose of gathering the addresses of posters. The DNS and WHOIS systems require the publication of technical contact information for all Internet domains spammers have illegally crawled these resources for email addresses. Many spammers utilize programs called Web Spiders to find email addresses on web pages.Because spammers offload the bulk of their costs onto others, however, they can use even more computationally expensive means to generate addresses.

Page 11: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Address Collection

A dictionary attack consists of an exhaustive attempt to gain access to a resource by trying all possible credentials ― usually, usernames and passwords. Spammers have applied this principle to guessing email addresses ― as by taking common names and generating likely email addresses for them at each of thousands of domain names.Spammers sometimes use various means to confirm addresses as deliverable. For instance, including a Web bug in a spam message written in HTML may cause the recipient's mail client to transmit the recipient's address, or any other unique key, to the spammer's Web site.

Page 12: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

TerminologyTo better understand the concepts in this presentation let us consider the following terminology.

Mail User Agent (MUA). This refers to the program used by the client to send and receive e-mail from. It is usually referred to as the "mail client." An example of this is Pine or Eudora.

Mail Transfer Agent (MTA). This refers to the program used running on theserver to store and forward e-mail messages. It is usually referred to as the "mail server program." An example of this is sendmail or the MicrosoftExchange server.

Page 13: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

The Mail Queue

Page 14: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

In a normal configuration, sendmail sits in the background waiting for new messages. When a new connection arrives, a child process is invoked to handle the connection, while the parent process goes back to listening for new connections.

When a message is received, the sendmail child process puts it into the mail queue (usually stored in /var/spool/mqueue). If it is immediately deliverable, it is delivered and removed from the queue. If it is not immediately deliverable, it will be left in the queue and the process will terminate.

Messages left in the queue will stay there until the next time the queue is processed. The parent sendmail will usually fork a child process to attempt to deliver anything left in the queue at regular intervals.

Page 15: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Structure of E-mail Message

Email messages are compose of two parts:

1. Headers (lines of the form "field: value" which contain information about the message, such as "To:", "From:", "Date:", and "Message- ID:")

2. Body (the text of the message)

Page 16: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Example

From [email protected] Mon Jul 5 23:46:19 1999Received: (from johndoe@localhost) by students.uiuc.edu (8.9.3/8.9.3) id LAA05394; Mon, 5 Jul 1999 23:46:18 -0500Received: from staff.uiuc.edu (staff.uiuc.edu [128.174.5.59]) by students.uiuc.edu (8.9.3/8.9.3) id XAA24214; Mon, 5 Jul 1999 23:46:25 -0500Date: Mon, 5 Jul 1999 23:46:18 -0500From: John Doe <[email protected]>To: John Smith <[email protected]>Message-Id: <[email protected]>Subject: This is a subject header.

This is the message body. It is seperated from the headers by a blankline.

The message body can span multiple lines.

Page 17: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Here is an example SMTP transaction:

1. Client connects to server's SMTP port (25). 2. Server: 220 staff.uiuc.edu ESMTP Sendmail 8.10.0/8.10.0 ready; Mon, 13 Mar 2000 14:54:08 -0600 3. Client: helo students.uiuc.edu 4. Server: 250 staff.uiuc.edu Hello [email protected] [128.174.5.62], pleased to meet you 5. Client: mail from: [email protected] 6. Server: 250 2.1.0 [email protected]... Sender ok 7. Client: rcpt to: [email protected] 8. Server: 250 2.1.5 [email protected]... Recipient ok 9. Client: data 10. Server: 354 Enter mail, end with "." on a line by itself 11. Client:

Received: (from johndoe@localhost) by students.uiuc.edu (8.9.3/8.9.3) id LAA05394; Mon, 5 Jul 1999 23:46:18 -0500Date: Mon, 5 Jul 1999 23:46:18 -0500From: John Doe <[email protected]>To: John Smith <[email protected]>Message-Id: <[email protected]>Subject: This is a subject header.

This is the message body. It is seperated from the headers by a blankline.The message body can span multiple lines. 12. Server: 250 2.0.0 e2DKuDw34528 Message accepted for delivery 13. Client: quit 14. Server: 221 2.0.0 staff.uiuc.edu closing connection

The sender and recipient addresses used in the SMTP transaction are called the Message Envelope. Note that these addresses do not need to have any similarity to the addresses in the message headers!

Page 18: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Delivering Spam messages

Early on, spammers discovered that if they sent large quantities of spam directly from their ISP accounts, recipients would complain and ISPs would shut their accounts down. Thus, one of the basic techniques of sending spam has become to send it from someone else's computer and network connection. By doing this, spammers protect themselves in several ways: they hide their tracks, get others' systems to do most of the work of delivering messages, and direct the efforts of investigators towards the other systems rather than the spammers themselves.

Page 19: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Mail filters

A mail filter is a piece of software which takes an input of an email message. For its output, it might pass the message through unchanged for delivery to the user's mailbox, it might redirect the message for delivery elsewhere, or it might even throw the message away. Some mail filters are able to edit messages during processing.

Page 20: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Introduction

Application of Text Categorization The Spam classification is defined as a binary

problem: Email is Spam OR is not Spam. Automatic text categorization assigns emails to one

of the above categories, using different methods One of this methods is the Centroid-based

classification

Hello,

Hi, this is your opportunity to buy a house with new mortage rates.

To find more about this, just click here.

SPAM

NOT SPAM

Page 21: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Background

Text Classification: classify documents into categories Spam un-spam

Classification process preprocess message

Remove tagStop-word removalWord stemming

Training --- build the classification model Testing --- evaluate the model

Page 22: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Methodologies

Bayes-NaivesCentroid-BasedContent-based

Page 23: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Bayesianism

Is the philosophical tenet that the mathematical theory of probability applies to the degree of plausibility of a statement. This also applies to the degree of believability contained within the rational agents of a truth statement. Additionally, when a statement is used with Bayes' theorem, it then becomes a Bayesian inference.

Page 24: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Baye's Rule

If A and B are two separate but possibly dependent random events, then:

Probability of A and B occurring together = Pr[(A,B)]

The conditional probability of A, given that B occurs = Pr[(A|B)]

The conditional probability of BB, given that AA occurs = Pr[(B|A)]

Page 25: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

From elementary rules of probability : Pr[(A,B)] = Pr[(A|B)]Pr[(B)] = Pr[(B|A)]Pr[(A)]

Dividing the right-hand pair of expressions by Pr[(B)] gives Bayes' rule:

Pr[A|B] = Pr[B|A]Pr[A] ----------------- Pr[B]

Page 26: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

In problems of probabilistic inference, we are often trying to estimate the most probable underlying model for a random process, based on some observed data or evidence. If AA represents a given set of model parameters, and BB represents the set of observed data values, then the terms in equation are given the following terminology:

➢Pr[A] is the prior probability of the model A (in the absence of any evidence)➢Pr[B] is the probability of the evidence B➢Pr[B|A] is the likelihood that the evidence B was produced, given that the model was A➢Pr[A|B] is the posterior probability of the model being A, given that the evidence is B.

Page 27: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Mathematically, Bayes' rule states

likelihood * priorposterior = ------------------------------

marginal likelihood

Page 28: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Representing E-mail for statistical Algorithms

All statistical algorithms for spam filtering begin with a vector representation of individual e-mail messages.The length of the term vector is the number of distinct words in all the e-mail messages in the training data. The entry for a particular word in the term vector for a particular e-mail message is usually he number of occurences of the word in the e-mail message.

Page 29: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Training data comprising four labeled e-mail

messagesTable below presents toy training data comprising four e-mail messages. These data contain ten distinct words: the, quick, brown, fox, rabbit, ran, and, run, at, and rest.

# Message Spam

1 The quick brown fox no2 The quick rabbit ran and ran yes3 rabbit run run run no4 rabbit at rest yes

Page 30: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Term Vectors corresponding to training data

# and at brown fox quick rabbit ran rest run the

1 1 0 0 0 1 1 1 0 0 0 0 1

2 1 2 0 0 2 0 0 1 0 1 0 1

3 0 0 0 0 3 1 1 1 0 0 1 0

4 2 0 3 2 0 0 0 0 1 0 1 1

Page 31: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

If the training data comprise thousands of e-mail messages, the number of distinct words often exceeds 10,000. Two simple strategies to reduce the size of the term vector somewhat are to remove “stop words” (words like and, of, the, etc.) and to reduce words to their root form, a process known as stemming (so, for example, “ran” and “run” reduce to “run”). Table 3 shows the reduced term vectors along with the spam label.

Page 32: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Term vectors after stemming and stop word

removal, spam label coded as 0=no,1=yes

X1 X2 X3 X4 X5 X6 Y

# brown fox quick rabbit rest run Spam1 1 1 1 0 2 1 02 0 1 1 0 3 0 13 0 0 1 0 0 1 04 0 0 0 1 1 2 1

Page 33: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Navie Bayes for Spam

Let X = (X1,. .., Xd) denote the term vector for a random e-mail message, where d is the number of distinct words in the training data, after stemming and stopword removal. Let Y denote the corresponding spam label. The Naive Bayes model seeks to build a model for:Pr(Y = 1|X1= x1,. .., Xd= xd).

From Bayes theorem, we have:

Pr(Y = 1|X1= x1,. .., Xd= xd) = Pr(Y = 1) * Pr(X1=x1,. .., Xd= xd|Y = 1) ------------------------------------------------ Pr(X1= x1,. .., Xd= xd)

Page 34: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Centroid-based method

The documents are represented using a Vector-space model.

Each document is represented as a Term Frequency vector (TF)

t1

t2

d1

d4d3

d2

Page 35: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Centroid-based method

A refinement of this model is the inverse document frequency (IDF)

This is to limit the discrimination power of frequent terms and stop words, and to emphasize words that appear in specific documents.

IDF is log(N/dfi)

The size of the document is normalized

Page 36: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Centroid-based method

The distance between two vectors is defined using the cosine function

Finallly, one Centroid Vector C is defined for each category (spam/not spam) as

Page 37: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Centroid-based method

We can measure the similarity between one document and the Centroid of the category with the following function

Page 38: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Steps: Centroid-based Method

1. TRAINNING

Determine the document vectors using TD/IDF.

t1

t2

d1 d4

d3

d2

d8

d6

d7

d5

Page 39: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Steps: Centroid-based Method

1. TRAINNING

Calculate the centroid for the categories SPAM and NOT SPAM

t1

t2

d1 d4

d3

d2

d8

d6

d7

d5

CSPAM

CNOT SPAM

Page 40: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Steps: Centroid-based Method

1. CLASSIFICATION

Given a new document dn, calculate the document vector representation (like in the training stage)

t1

t2

dn

Page 41: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Steps: Centroid-based Method

1. CLASSIFICATION

Measure the distance between the vector dn and the Centroids of the Categories SPAM / NOT SPAM

t1

t2 CSPAM

CNOT SPAM

dn

Page 42: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Steps: Centroid-based Method

1. CLASSIFICATION (cont.)

Measure the distance between the vector dn and the Centroids of the Categories SPAM / NOT SPAM

t1

t2 CSPAM

CNOT SPAM

dn

Page 43: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Steps: Centroid-based Method

1. FINAL RESULT

Obtain the maximum similarity between the document and the Centroids of SPAM and NOT SPAM

for i=1,2 where 1=SPAM and 2=NOT SPAM

Page 44: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Analysis of Results

The standard methodology for measuring performance of text classification methods are the Precision and Recall

n. of correctly predicted positives

N of predicted positive examples

P=

n. of correctly predicted positives

N of all positive examplesR=

Page 45: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Analysis of Results

None Precision or Recall can give a good measure by themselves. To have an idea of the performance, we have to combine them.

2PR

P+RF=

P

R

better

Page 46: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Some results

Compared agains kNN and Naïve Based, the Centroid method performs better

Page 47: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Content Based Approach

Spam can be detected before reading the message --- non-content

based:Based on special protocol [3] – voip protocolBased on address book[1] – build an email

networkBased on IP address [4]…..

After process the content of the email --- content based

Page 48: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Content-based Approach

Non-content based approach

remove spam message if contain virus, worms before read.

leaves some messages un-labeled

Content based method:

widely used method

may need lots pre-labeled message

label message based its content

Zdziarski[5] said that it's possible to stop spam, and that content-based filters are the way to do it

Focus on content based method

Page 49: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Method of content-based

Bayesian based method [6] Centroid-based method[7] Machine learning method [8]

Latent Semantic indexing LSI Contextual Network Graphs (CNG)

Rule based method[9] ripper rule: a list of predefined rules that can be changed

by hand Memory based method[10]

saving cost

Page 50: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Measurement

Accuracy: the percentage of correct classified correct/(correct + un-correct)

False positive: if a message is a spam, but misclassify to un-spam. Goal:

Improve accuracyPrevent false positive

SpamNo spam

Correct

Un-correctFalse positive

Page 51: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

measurement

Relevant documents

Retrieved documents

Entire document collection

retrieved & relevant

not retrieved but relevant

retrieved & irrelevant

Not retrieved & irrelevant

retrieved not retrieved

rele

vant

irre

leva

nt

documents relevant of number Total

retrieved documents relevant of Number recall

retrieved documents of number Total

retrieved documents relevant of Number precision

Page 52: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Rule-based method

A list of predefined rules that can be changed by hand ripper rule

Each rule/test associated with a score If an email fails a rule, its score increasedAfter apply all rules, if the score is above a

certain threshold, it is classified as spam

Page 53: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Rule-based method

Advantage: able to employ diverse and specific rule to check

spamCheck size of the emailNumber of pictures it contains

no training messages are neededDisadvantage:

rules have to be entered and maintained by hand --- can’t be automatically

Page 54: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Latent Semantic indexing

Keyword important word for text classification High frequent word in a message Can used as an indicator for the message

Why LSI? Polysemy: word can be used in more than one category

ex: Play

Synonymous: if two words have identical meaning

Based on nearest neighbors based algorithm

Page 55: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Latent semantic indexing

consider semantic links between words Search keyword over the semantic space Two words have the same meaning are treated as one

word eliminate synonymous

Consider the overlap between different message, this overlap may indicate: polysemy or stop-word two messages in same category

Page 56: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Latent semantic indexing

Step1: build a term-document matrix X from the input documents

Doc1: computer science departmentDoc2: computer science and engineering scienceDoc3: engineering school

Computer science department and engineering schoolDoc1 1 1 1 0 0 0Doc2 1 2 0 1 1 0Doc3 0 0 0 0 1 1

Page 57: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Latent semantic indexing

Step2: Singular value Decomposition (SVD) is performed on matrix X To extract a set of linearly independent FACTORS

that describe the matrix Generalize the terms have the same meaning Three new matrices TSD are produced to reduce

the vocabulary’s size

Page 58: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Latent Semantic indexing

Two document can be compared by finding the distance between two document vector, stored in matrix X1

Text classification is done by finding the nearest neighbors – assign to category with max document

Page 59: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Spam:

Un-spam:

Test:

Nearest neighbors method classify the test message to be UN-SPAM

Page 60: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Latent Semantic Indexing

Advantage: Entire training set can be learned at same time No intermediate model need to be build Good for the training set is predefined

Disadvantage: When new document added, matrix X changed, and TSD

need to be re-calculated Time consuming Real classifier need the ability to change training set

Page 61: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Contextual network Graphs

A weighted, bipartite, undirected graph of term and document nodes

d1 d2t2

t1

t3

w11

w12

w13

w21

w22

w23

At any time, for each node, the sum of the weigh is 1

t1: W11+w21 = 1d1: w11+w12+w13 = 1

Page 62: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Contextual network graphs

When new document d is added, energizing the weights at node d, and may need re-balance the weights at the connected node

The document is classified to the one with maximum of energy (weight) average for each class

Page 63: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Comparison Bayesian, LSI,CNG, centroid, rule-

based

noYesYesYesYesAutomatic update model

noNoYesYesNoSemantic

Test against rule

Recalculation of centroid

Addition nodes to graph

Recalculation matrices

Update statistics

Learning

Predefined rules

Cosine similarity

TF-IDF

Energy re-balancing

Generalization/contextual data

Statistical/probabilities

Classification

RuleCentroidCNGLSIBayesian

Page 64: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Result

Page 65: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Result and conclusion:

LSI & CNG super Bayesian approach 5% accuracy, and reduce false positive and negatives up to 71%

LSI & CNG shows better performance even with small document set

Page 66: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Comparison content based and non-content based

Non-content based: dis-adv:

depends on special factor like email address, IP address, special protocol,

leaves some un-classified Adv: detect spam before reading message with

high accuracy

Page 67: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Content based: Disadvantage:

need some training message not 100% correct classified due to the spammer also

know the anti-spam tech. Advantage:

leaves no message unclassified

Page 68: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Improvement for spam

Combine both method [1] proposes an email network based algorithm, which with

100% accuracy, but leaves 47% unclassified, if combine with content based method, can improve the performance.

Build up multi-layers[11]

[11] Chris Miller, A layered Approach to enterprise antispam

Page 69: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Data set for spam:

Non-content based: Email network:

One author’s email corpus, formed by 5,486 messages IP address: -- none

Content based:

Page 70: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Data set for spam

LSI & CNG: Corpus of varying size (250 ~ 4000) Spam and un-spam emails in equal amount

Bayesian based: Corpus of 1789 email 211 spam, 1578 non-spam

Cetroid based: Totally 200 email message 90 spam, 110 non-spam

Page 71: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Most recently used Benchmarks:

Reuters: About 7700 training and 3000 test documents, 30000 terms,135 categories,

21MB. each category has about 57 instances collection of newswire stories

20NG: About 18800 total documents, 94000 terms, 20 topics, 25MB. Each category has about 1000 instances

WebKB: About 8300 documents, 7 categories, 26 MB. Each category has about 1200 instances 4 university website data

Above three are well-known in recently IR with small in size and used to test the performance and CPU scalability

Page 72: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Benchmarks

OHSUMED: 348566 document, 230000 terms and 308511 topics, 400 MB. Each category has about 1 instance Abstract from medical journals

Dmoz: 482 topics, 300 training document for each topic, 271MB Each category has less than 1 instance taken from Dmoz(http://dmoz.org/) topic tree

Large dataset, used to test the memory scalability of a model

Page 73: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Some facts

Spam is a growing problem, and the research on this topic has become more relevant the last years

Spam grows because it works.Many commercial products try to fight spam.

Most of them rely on the exposed techniques, or combination of them

Spam damages economy, more than hackers or viruses

Page 74: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Some facts

Damages attributed to Spam are calculated around 10.4 billion in 2003, 58 – 112 billion in 2004, and is projected to cross 200 billion worldwide in 2005.

1.6 trillion unsolicited messages were sent in 2004.

Page 75: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Conclusions

Spam is a problem that causes a great impact of global business

We presented three methods for Spam classification.

The benchmarks on this three methods suggest that combination of the methods perform better than the methods alone

Page 76: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Conclusions

Spam classifiers can be Content Based, and Non Content Based

Content Based: Rules, Naïve Bayes, Centroid

Non content work without reading the content of the mail

Page 77: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Conclusions

Researchers have found ways to increase the accuracy of all the methods, using heuristics and combining them

Spammers also learn how to avoid spam filters

No single method is perfect in all situations

Page 78: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

Sources

Slide 1, image: ttp://www.ecommerce-guide.com

Slide 1, image: ttp://www.email-firewall.jp/products/das.html

Page 79: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

References

Anti-spam Filtering: A centroid-based Classification Approach, Nuanwan Soonthornphisaj, Kanokwan Chaikulseriwat, Piyan Tang-On, 2002

Centroid-Based Document Classification: Analysis & Experimental Results, Eui-Hong (Sam) and George Karypis, 2000

Multi-dimensional Text classification, Thanaruk Theeramunkog, 2002

Improving centroid-based text classification using term-distribution-based weighting system and clustering, Thanaruk Theeramunkog and Verayuth Lertnattee

Combining Homogeneous Classifiers for Centroid-based text classifications, Verayuth Lertnattee and Thanaruk Theeramunkog

Page 80: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

References

[1] P Oscar Boykin and Vwani Roychowdhury, Personal Email Networks: An Effective Anti-Spam Tool, IEEE COMPUTER, volume 38, 2004

[2] Andras A. Benczur and Karoly Csalogany and Tamas Sarlos and Mate Uher, SpamRank - Fully Automatic Link Spam Detection, citeseer.ist.psu.edu/benczur05spamrank.html

[3]. R. Dantu, P. Kolan, “Detecting Spam in VoIP Networks”, Proceedings of USENIX, SRUTI (Steps for Reducing Unwanted Traffic on the Internet) workshop, July 05(accepted)

[4]. IP addresses in email clients ttp://www.ceas.cc/papers-2004/162.pdf

[5] Plan for Spam ttp:// ww.paulgraham.com/spam.html

Page 81: Presented by Christian Loza, Srikanth Palla and Liqin Zhang SPAM Christian Loza Srikanth Palla Liqin Zhang

presented by Christian Loza, Srikanth Palla and Liqin Zhang

References

[6] M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. 1998, “A Bayesian Approach to Filtering Junk E-Mail”, Learning for Text Categorization – Papers from the AAAI Workshop, pages 55–62, Madison Wisconsin. AAAI Technical Report WS-98-05

[7] N. Soonthornphisaj, K. Chaikulseriwat, P Tang-On, “Anti-Spam Filtering: A Centroid Based Classification Approach”, IEEE proceedings ICSP 02

[8] Spam Filtering Using Contextual Networking Graphs www.cs.tcd.ie/courses/csll/dkellehe0304.pdf

[9] W.W. Cohen, “Learning Rules that Classify e-mail”, In Proceedings of the AAAI Spring Symposium on Machine Learning in Information Access, 1996

[10] G. Sakkis, I. Androutsopoulos, G. Paliouras, V. Karkaletsis, C.D. Spyropoulos, P. Stamatopoulos, “A memory based approach to anti-spam filtering for mailing lists”, Information Retrieval 2003