Email Spam Detection using machine Learning

EMAIL SPAM DETECTION USING MACHINE LEARNINGLydia Song, Lauren Steimle, Xiaoxiao Xu

Outline Introduction to Project Pre-processing Dimensionality Reduction Brief discussion of different algorithms

K-nearest Decision tree Logistic regression Naïve-Bayes

Preliminary results Conclusion

Spam Statistics Percentage of Spam Emails in email traffic

averaged 69.9% in February 2014

Source: https://www.securelist.com/en/analysis/204792328/Spam_report_February_2014

Spam vs. HamSpam=Unwanted communication

Ham=Normal communication

Pre-processing

Example of Spam Email Corresponding File in Data Set

Pre-processing1. Remove meaningless words2. Create a “bag of words” used in data

set3. Combine similar words4. Create a feature matrix

Email 1Email 2

Email m

“histor

y”“se

e”Bag of Wordshistory

service

“last”

Pre-processing ExampleYour history shows that your last order is ready for refilling.

Thank you,

Sam McfarlandCustomer Services

tokens= [‘your’, ‘history’, ‘shows’, ‘that’, ‘your’, ‘last’, ‘order’, ‘is’, ‘ready’, ‘for’, ‘refilling’, ‘thank’, ‘you’, ‘sam’, ‘mcfarland’, ‘customer services’]

filtered_words=[ 'history', 'last', 'order', 'ready', 'refilling', 'thank', 'sam', 'mcfarland', 'customer', 'services']

bag of words=['history', 'last', 'order', 'ready', 'refill', 'thank', 'sam', 'mcfarland', 'custom', 'service']

Email 1Email 2

Email m

“histo

i”“se

rvi”

“last”Bag of

Wordshistori

Dimensionality Growth Add ~100-150 features for each

additional email

50 100 150 200 250 3000

Growth of Number of Features

Number of Emails Considered

Dimensionality Reduction Add a requirement that words must

appear in x% of all emails to be considered a feature

50 100 150 200 250 3000

Growth of Features with Cutoff Requirement

5%10%15%20%

Number of Emails Considered

Dimensionality Reduction-Hashing Trick

Before Hashing: 70x9403 Dimensions After Hashing: 70x1024 Dimensions

String

Integer

Hash Table Index

Source: Jorge Stolfi, http://en.wikipedia.org/wiki/File:Hash_table_5_0_1_1_1_1_1_LL.svg#filelinks

K-Nearest Neighbors Goal: Classify an unknown training

sample into one of C classes Idea: To determine the label of an

unknown sample (x), look at x’s k-nearest neighbors

Image from MIT Opencourseware

Decision Tree Convert training data

into a tree structure Root node: the first

decision node Decision node: if–then

decision based on features of training sample

Leaf Node: contains a class label

Image from MIT Opencourseware

Logistic Regression “Regression” over training examples

Transform continuous y to prediction of 1 or 0 using the standard logistic function

Predict spam if

Naïve Bayes Use Bayes Theorem: Hypothesis (H): spam or not spam Event (e): word occurs For example, the probability an email is

spam when the word “free” is in the email

“Naïve”: assume the feature values are independent of each other

Preliminary Results 250 emails in training set, 50 in testing set Use 15% as the “percentage of emails” cutoff Performance measures:

Accuracy: % of predictions that were correct Recall: % of spam emails that were predicted

correctly Precision: % of emails classified as spam that

were actually spam F-Score: weighted average of precision and recall

“Percentage of Emails” Performance

Linear Regression Logistic Regression

Preliminary Results

Next Steps

Implement SVM: Matlab vs. Weka

Hashing trick- try different number of

buckets

Regularizations

Thank you! Any questions?

Email Spam Detection using machine Learning

Documents

Spam Email: 8 Do's and Don'ts

Data Science in Action · PDF fileData Science in Action Peerapon Vateekul, Ph.D. ... Email Classification Spam Detection ... Neural Network

Spam Campaign Detection, Analysis, and Investigation

Anti-Spam Outbound Service - CYREN...2019/09/12 · (Email Security Environment) Outbound Email Clean Outbound Email Spammer ID Samples Deleted spam, phishing, malware Outbound email

Blog Track Open Task: Spam Blog Detection

Object Detection and Digitization from Aerial Imagery ...€¦ · “training” email data with a binary spam/not spam label for each email. The algorithm is then able to determine

Algorithmic Web Spam detection - Matt Peters MozCon

Behavior-based Email Analysis with Application to Spam Detection

Link Spam Detection Based on Mass

Tag Spam Creates Large Non -Giant Connected Components€¦ · Spam Detection New spam detection experiments: • applied above heuristic on document/user graph (red) • compared

Forefront Spam Filtering · Valid email? Delete Known Spam or Virus . W&L Email Server: Valid email? Quarantine Suspected Spam or Virus . Yes . Yes . No ? Blue arrows denote automatic

Introduction au Machine learning M1MINT€¦ · Introduction au Machine learning M1MINT Agathe Guilloux. Introduction. Spam detection Données : emails Input : email Output : Spam

Man vs. Machine: Adversarial Detection of Malicious ... · Machine Learning for Security • Machine learning (ML) to solve security problems – Email spam detection – Intrusion/malware

privacy aware collaborative spam detection

A Social Network Spam Detection Model

Introduction Spam in Society Email Spam IM Spam Text Spam Blog Spamming Spam Blogs

Spam Detection

COMPARISON AND ANALYSIS OF SPAM DETECTION … · Our research paper consists of comprehensive study of spam detection algorithms under ... org, editorijaiem@gmail.com ... and email

Spam Detection Jingrui He 10/08/2007. Spam Types Email Spam Unsolicited commercial email Blog Spam Unwanted comments in blogs Splogs Fake blogs

Forcepoint Email Security Cloud€¦ · Dashboard – View up-to-date graphical data for email volumes, inbound email composition, and spam detection rates. Directory synchronization