47
Comment Spam Identification Eric Cheng & Eric Steinlauf

Comment Spam Identification

  • Upload
    walt

  • View
    34

  • Download
    0

Embed Size (px)

DESCRIPTION

Comment Spam Identification. Eric Cheng & Eric Steinlauf. What is comment spam?. Total spam:1,226,026,178 Total ham:62,723,306. 95% are spam!. Source: http://akismet.com/stats/ Retrieved 4/22/2007. Countermeasures. 5yx.org 9kx.com aakl.com aaql.com aazl.com abcwaynet.com - PowerPoint PPT Presentation

Citation preview

Page 1: Comment Spam Identification

Comment Spam Identification

Eric Cheng & Eric Steinlauf

Page 2: Comment Spam Identification

What is comment spam?

Page 3: Comment Spam Identification
Page 4: Comment Spam Identification
Page 5: Comment Spam Identification

Total spam: 1,226,026,178Total ham: 62,723,306

95% are spam!Source: http://akismet.com/stats/ Retrieved 4/22/2007

Page 6: Comment Spam Identification

Countermeasures

Page 7: Comment Spam Identification

Blacklisting5yx.org9kx.comaakl.comaaql.comaazl.comabcwaynet.comabgv.comabjg.comablazeglass.comabseilextreme.netactionbenevole.comacvt.comadbx.comadhouseaz.comadvantechmicro.comaeur.comaeza.comagentcom.comailh.orgakbu.comalaskafibre.comalkm.comalqx.comalumcasting-eng-inc.co!americanasb.comamwayau.comamwaynz.comamwaysa.comamysudderjoy.comanfb.comanlusa.netaobr.comaoeb.comapoctech.comapqf.comareagent.comartstonehalloweencostumes.com

globalplasticscrap.comgowest-veritas.comgreenlightgo.orghadjimitsis.comhealthcarefx.comherctrade.comhobbyhighway.comhominginc.comhongkongdivas.comhpspyacademy.comhzlr.comidlemindsonline.cominternetmarketingserve.comjesh.orgjfcp.comjfss.comjittersjapan.comjkjf.comjkmrw.comjknr.comjksp.comjkys.comjtjk.comjustfareed.comjustyourbag.comkimsanghee.orgkiosksusa.comknivesnstuff.comknoxvillevideo.comksj!kwscuolashop.comlancashiremcs.comlnjk.comlocalmediaaccess.comlrgww.commarketing-in-china.com

rockymountainair.org

rstechresources.com

samsung-integer.com

sandiegonhs.org

screwpile.org

scvend.org

sell-in-china.com

sensationalwraps.com

sevierdesign.com

starbikeshop.com

struthersinc.com

swarangeet.com

thecorporategroup.net

thehawleyco.com

thehumancrystal.com

thinkaids.org

thisandthatgiftshop.net

thomsungroup.com

ti0.org

timeby.net

tradewindswf.com

tradingb2c.com

turkeycogroup.net

vassagospalace.com

vyoung.net

web-toggery.com

webedgewars.com

webshoponsalead.com

webtoggery.com

willman-paris.com

worldwidegoans.com

Page 8: Comment Spam Identification

Captchas

• "Completely Automated Public Turing test to tell Computers and Humans Apart"

Page 9: Comment Spam Identification

Other ad-hoc/weak methods

• Authentication / registration• Comment throttling• Disallowing links in comments• Moderation

Page 10: Comment Spam Identification

Our Approach – Naïve Bayes

• Statistical• Adaptive• Automatic• Scalable and extensible• Works well for spam e-mail

Page 11: Comment Spam Identification

Naïve Bayes

Page 12: Comment Spam Identification

P(A|B) ∙ P(B) = P(B|A) ∙ P(A)= P(AB)

Page 13: Comment Spam Identification

P(A|B) ∙ P(B) = P(B|A) ∙ P(A)

Page 14: Comment Spam Identification

P(A|B) = P(B|A) ∙ P(A) / P(B)

Page 15: Comment Spam Identification

P(spam|comment) = P(comment|spam) ∙ P(spam) / P(comment)

Page 16: Comment Spam Identification

P(spam|comment) = P(comment|spam) ∙ P(spam) / P(comment)

Page 17: Comment Spam Identification

P(spam|comment) = P(w1|spam) ∙ P(w2|spam) ∙ … P(wn|spam) ∙ P(spam) / P(comment)

(naïve assumption)

Probability of w1 occurring given a spam comment

Page 18: Comment Spam Identification

P(w1|spam) =1 – (1 – x/y)n

Probability of w1 occurring given a spam comment

where x is the number of times w1 appears in all spam messages, y is the total number of words in all spam messages, and n is the length of the given comment

Texas casino Online Texas hold’em

Texas gambling site

P(Texas|spam) = 1 – (1 – 2/5)3 = 0.784

Corpus Incoming Comment

Page 19: Comment Spam Identification

P(spam|comment) = P(w1|spam) ∙ P(w2|spam) ∙ … P(wn|spam) ∙ P(spam) / P(comment)

Probability of w1 occurring given a spam comment

Page 20: Comment Spam Identification

P(spam|comment) = P(w1|spam) ∙ P(w2|spam) ∙ … P(wn|spam) ∙ P(spam) / P(comment)

Probability of w1 occurring given a spam comment

Probability of something being spam

Page 21: Comment Spam Identification

P(spam|comment) = P(w1|spam) ∙ P(w2|spam) ∙ … P(wn|spam) ∙ P(spam) / P(comment)

Probability of w1 occurring given a spam comment

Probability of something being spam ??????

Page 22: Comment Spam Identification

P(spam|comment) = P(w1|spam) ∙ P(w2|spam) ∙ … P(wn|spam) ∙ P(spam) / P(comment)

Probability of w1 occurring given a spam comment

Probability of something being spam ??????

P(ham|comment) = P(w1|ham) ∙ P(w2|ham) ∙ … P(wn|ham) ∙ P(ham) / P(comment)

Page 23: Comment Spam Identification

P(spam|comment) P(w1|spam) ∙ P(w2|spam) ∙ … P(wn|spam) ∙ P(spam)

Probability of w1 occurring given a spam comment

Probability of something being spam

P(ham|comment) P(w1|ham) ∙ P(w2|ham) ∙ … P(wn|ham) ∙ P(ham)

Page 24: Comment Spam Identification

P(spam|comment) P(w1|spam) ∙ P(w2|spam) ∙ … P(wn|spam) ∙ P(spam))

P(ham|comment) P(w1|ham) ∙ P(w2|ham) ∙ … P(wn|ham) ∙ P(ham))

log(

log(

)

)

log(

log(

Page 25: Comment Spam Identification

log(P(spam|comment)) log(P(w1|spam)) + log(P(w2|spam)) + … log(P(wn|spam)) + log(P(spam))

log(P(ham|comment)) log(P(w1|ham)) + log(P(w2|ham)) + … log(P(wn|ham)) + log(P(ham))

Page 26: Comment Spam Identification

P(spam|comment) = 1 – P(ham|comment)

Fact:

Abuse of notation:

P(s) = P(spam|comment)P(h) = P(ham|comment)

Page 27: Comment Spam Identification

P(s) = 1 – P(h)

m = log(P(s)) – log(P(h))

= log(P(s)/P(h))

em = elog(P(s)/P(h))

= P(s)/P(h)

em ∙ P(h) = P(s)

Page 28: Comment Spam Identification

P(s) = 1 – P(h)

m = log(P(s)) – log(P(h))

em ∙ P(h) = P(s)

em ∙ P(h) = 1 – P(h)

(em + 1) ∙ P(h) = 1

P(h) = 1/(em+1)P(s) = 1 – P(h)

Page 29: Comment Spam Identification

m = log(P(s)) – log(P(h))

P(h) = 1/(em+1)P(s) = 1 – P(h)

Page 30: Comment Spam Identification

m = log(P(spam|comment)) – log(P(ham|comment))

P(ham|comment) = 1/(em+1)P(spam|comment) = 1 – P(ham|

comment)

Page 31: Comment Spam Identification

log(P(ham|comment))

log(P(spam|comment))

In practice, just compare

Page 32: Comment Spam Identification

Implementation

Page 33: Comment Spam Identification

Corpus

• A collection of 50 blog pages with 1024 comments

• Manually tagged as spam/non-spam• 67% are spam• Provided by the Informatics Institute

at University of Amsterdam

Blocking Blog Spam with Language Model Disagreement, G. Mishne, D. Carmel, and R. Lempel. In: AIRWeb '05 - First International Workshop on Adversarial Information Retrieval on the Web, at the 14th International World Wide Web Conference (WWW2005), 2005.

Page 34: Comment Spam Identification

Most popular spam wordscasino 0.999918 0.00008207

6

betting 0.999879 0.000120513

texas 0.999813 0.000187148

biz 0.999776 0.000223708

holdem 0.999738 0.000262111

poker 0.999551 0.000448675

pills 0.999527 0.000473407

pokerabc 0.999506 0.000493821

teen 0.999455 0.000544715

online 0.999455 0.000544715

bowl 0.999437 0.000562555

gambling 0.999437 0.000562555

sonneries 0.999353 0.000647359

blackjack 0.999346 0.000653516

pharmacy 0.999254 0.000745723

Page 35: Comment Spam Identification

“Clean” wordsedu 0.00287339 0.997127

projects 0.00270528 0.997295

week 0.00270528 0.997295

etc 0.00270528 0.997295

went 0.00270528 0.997295

inbox 0.00270528 0.997295

bit 0.00270528 0.997295

someone 0.00255576 0.997444

bike 0.00230136 0.997699

already 0.00230136 0.997699

selling 0.00219225 0.997808

making 0.00209302 0.997907

squad 0.00184278 0.998157

left 0.00177216 0.998228

important 0.0013973 0.998603

pimps 0.000427782 0.999572

Page 36: Comment Spam Identification

Implementation

• Corpus parsing and processing• Naïve Bayes algorithm• Randomly select 70% for training,

30% for testing• Stand-alone web service• Written entirely in Python

Page 37: Comment Spam Identification

It’s showtime!

Page 38: Comment Spam Identification

Configurations

• Separator used to tokenize comment• Inclusion of words from header• Classify based only on most significant

words• Double count non-spam comments• Include article body as non-spam example• Boosting

Page 39: Comment Spam Identification

Minimum Error Configuration

• Separator: [^a-z<>]+• Header: Both• Significant words: All• Double count: No• Include body: No• Boosting: No

Page 40: Comment Spam Identification

Varying Configuration Parameters

Both Include Tag

0.0

60

.08

0.1

00

.12

0.1

40

.16

Header Inclusion vs. Test Set Error

Header Inclusion Method

Tes

t Se

t Err

or

[^a-z]+ [^a-z<>]+ \W+

0.0

590

0.0

595

0.0

600

0.0

605

0.0

610

0.0

615

0.0

620

Word Separator vs. Test Set Error

Separator RegEx

Tes

t Se

t Err

or

Page 41: Comment Spam Identification

Varying Configuration Parameters

False True

0.0

590

0.0

595

0.0

600

0.0

605

0.0

610

0.0

615

0.0

620

Double Counting Non Spam

Double Counting

Tes

t Se

t Err

or

2 4 6 8 10 12 14

0.1

00

.15

0.2

00

.25

Top Word Filtering Method vs. Test Set Error

Top Word Filtering Method

Te

st S

et E

rro

r

Page 42: Comment Spam Identification

Boosting

• Naïve Bayes is applied repeatedly to the data.• Produces Weighted Majority Model

bayesModels = empty list

weights = vector(1)

for i in 1 to M:

model = naiveBayes(examples, weights)

error = computeError(model, examples)

weights = adjustWeights(examples, weights, error)

bayesModels[i] = [model, error]

if error==0: break

Page 43: Comment Spam Identification

Boosting

5 10 15 20 25

0.0

16

0.0

18

0.0

20

0.0

22

0.0

24

Boosting Level vs. Training Set Error

Boosting Level

Te

st S

et E

rro

r

5 10 15 20 25

0.1

05

0.1

10

0.1

15

0.1

20

0.1

25

0.1

30

0.1

35

0.1

40

Boosting Level vs. Test Set Error

Boosting Level

Te

st S

et E

rro

r

Page 44: Comment Spam Identification

Future work(or what we did not do)

Page 45: Comment Spam Identification

Data Processing

• Follow links in comment and include words in target web page

• More sophisticated tokenization and URL handling (handling $100,000...)

• Word stemming

Page 46: Comment Spam Identification

Features

• Ability to incorporate incoming comments into corpus

• Ability to mark comment as spam/non-spam

• Assign more weight on page content• Adjust probability table based on

page content, providing content-sensitive filtering

Page 47: Comment Spam Identification

Comments?

No spam, please.