Email Spam Detection using machine Learning

Preview:

DESCRIPTION

Email Spam Detection using machine Learning. Lydia Song, Lauren Steimle, Xiaoxiao Xu. Outline. Introduction to Project Pre-processing Dimensionality Reduction Brief discussion of different algorithms K-nearest D ecision tree Logistic regression Naïve-Bayes Preliminary results - PowerPoint PPT Presentation

Citation preview

EMAIL SPAM DETECTION USING MACHINE LEARNINGLydia Song, Lauren Steimle, Xiaoxiao Xu

Outline Introduction to Project Pre-processing Dimensionality Reduction Brief discussion of different algorithms

K-nearest Decision tree Logistic regression Naïve-Bayes

Preliminary results Conclusion

Spam Statistics Percentage of Spam Emails in email traffic

averaged 69.9% in February 2014

Source: https://www.securelist.com/en/analysis/204792328/Spam_report_February_2014

Perc

enta

ge o

f spa

m

in e

mai

l tra

ffic

Spam vs. HamSpam=Unwanted communication

Ham=Normal communication

Pre-processing

Example of Spam Email Corresponding File in Data Set

Pre-processing1. Remove meaningless words2. Create a “bag of words” used in data

set3. Combine similar words4. Create a feature matrix

Email 1Email 2

Email m

“histor

y”“se

rvic

e”Bag of Wordshistory

last

service

“last”

Pre-processing ExampleYour history shows that your last order is ready for refilling.

Thank you,

Sam McfarlandCustomer Services

tokens= [‘your’, ‘history’, ‘shows’, ‘that’, ‘your’, ‘last’, ‘order’, ‘is’, ‘ready’, ‘for’, ‘refilling’, ‘thank’, ‘you’, ‘sam’, ‘mcfarland’, ‘customer services’]

filtered_words=[ 'history', 'last', 'order', 'ready', 'refilling', 'thank', 'sam', 'mcfarland', 'customer', 'services']

bag of words=['history', 'last', 'order', 'ready', 'refill', 'thank', 'sam', 'mcfarland', 'custom', 'service']

Email 1Email 2

Email m

“histo

r

i”“se

rvi”

“last”Bag of

Wordshistori

last

servi

Dimensionality Growth Add ~100-150 features for each

additional email

50 100 150 200 250 3000

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

Growth of Number of Features

Number of Emails Considered

Num

ber

of F

eatu

res

Dimensionality Reduction Add a requirement that words must

appear in x% of all emails to be considered a feature

50 100 150 200 250 3000

100

200

300

400

500

600

Growth of Features with Cutoff Requirement

5%10%15%20%

Number of Emails Considered

Num

ber

of F

eatu

res

Dimensionality Reduction-Hashing Trick

Before Hashing: 70x9403 Dimensions After Hashing: 70x1024 Dimensions

String

Integer

Hash Table Index

Source: Jorge Stolfi, http://en.wikipedia.org/wiki/File:Hash_table_5_0_1_1_1_1_1_LL.svg#filelinks

Outline Introduction to Project Pre-processing Dimensionality Reduction Brief discussion of different algorithms

K-nearest Decision tree Logistic regression Naïve-Bayes

Preliminary results Conclusion

K-Nearest Neighbors Goal: Classify an unknown training

sample into one of C classes Idea: To determine the label of an

unknown sample (x), look at x’s k-nearest neighbors

Image from MIT Opencourseware

Decision Tree Convert training data

into a tree structure Root node: the first

decision node Decision node: if–then

decision based on features of training sample

Leaf Node: contains a class label

Image from MIT Opencourseware

Logistic Regression “Regression” over training examples

Transform continuous y to prediction of 1 or 0 using the standard logistic function

Predict spam if

Naïve Bayes Use Bayes Theorem: Hypothesis (H): spam or not spam Event (e): word occurs For example, the probability an email is

spam when the word “free” is in the email

“Naïve”: assume the feature values are independent of each other

Outline Introduction to Project Pre-processing Dimensionality Reduction Brief discussion of different algorithms

K-nearest Decision tree Logistic regression Naïve-Bayes

Preliminary results Conclusion

Preliminary Results 250 emails in training set, 50 in testing set Use 15% as the “percentage of emails” cutoff Performance measures:

Accuracy: % of predictions that were correct Recall: % of spam emails that were predicted

correctly Precision: % of emails classified as spam that

were actually spam F-Score: weighted average of precision and recall

“Percentage of Emails” Performance

Linear Regression Logistic Regression

Preliminary Results

Next Steps

Implement SVM: Matlab vs. Weka

Hashing trick- try different number of

buckets

Regularizations

Thank you! Any questions?

Recommended