Comment Spam Identification

Comment Spam Identification

Eric Cheng & Eric Steinlauf

What is comment spam?

Total spam: 1,226,026,178Total ham: 62,723,306

95% are spam!Source: http://akismet.com/stats/ Retrieved 4/22/2007

http://akismet.com/stats/

Countermeasures

Blacklisting5yx.org9kx.comaakl.comaaql.comaazl.comabcwaynet.comabgv.comabjg.comablazeglass.comabseilextreme.netactionbenevole.comacvt.comadbx.comadhouseaz.comadvantechmicro.comaeur.comaeza.comagentcom.comailh.orgakbu.comalaskafibre.comalkm.comalqx.comalumcasting-eng-inc.co!americanasb.comamwayau.comamwaynz.comamwaysa.comamysudderjoy.comanfb.comanlusa.netaobr.comaoeb.comapoctech.comapqf.comareagent.comartstonehalloweencostumes.com

globalplasticscrap.comgowest-veritas.comgreenlightgo.orghadjimitsis.comhealthcarefx.comherctrade.comhobbyhighway.comhominginc.comhongkongdivas.comhpspyacademy.comhzlr.comidlemindsonline.cominternetmarketingserve.comjesh.orgjfcp.comjfss.comjittersjapan.comjkjf.comjkmrw.comjknr.comjksp.comjkys.comjtjk.comjustfareed.comjustyourbag.comkimsanghee.orgkiosksusa.comknivesnstuff.comknoxvillevideo.comksj!kwscuolashop.comlancashiremcs.comlnjk.comlocalmediaaccess.comlrgww.commarketing-in-china.com

rockymountainair.org

rstechresources.com

samsung-integer.com

sandiegonhs.org

screwpile.org

scvend.org

sell-in-china.com

sensationalwraps.com

sevierdesign.com

starbikeshop.com

struthersinc.com

swarangeet.com

thecorporategroup.net

thehawleyco.com

thehumancrystal.com

thinkaids.org

thisandthatgiftshop.net

thomsungroup.com

ti0.org

timeby.net

tradewindswf.com

tradingb2c.com

turkeycogroup.net

vassagospalace.com

vyoung.net

web-toggery.com

webedgewars.com

webshoponsalead.com

webtoggery.com

willman-paris.com

worldwidegoans.com

Captchas

• "Completely Automated Public Turing test to tell Computers and Humans Apart"

Other ad-hoc/weak methods

• Authentication / registration• Comment throttling• Disallowing links in comments• Moderation

Our Approach – Naïve Bayes

• Statistical• Adaptive• Automatic• Scalable and extensible• Works well for spam e-mail

Naïve Bayes

P(A|B) ∙ P(B) = P(B|A) ∙ P(A)= P(AB)

P(A|B) ∙ P(B) = P(B|A) ∙ P(A)

P(A|B) = P(B|A) ∙ P(A) / P(B)

P(spam|comment) = P(comment|spam) ∙ P(spam) / P(comment)

P(spam|comment) = P(comment|spam) ∙ P(spam) / P(comment)

P(spam|comment) = P(w1|spam) ∙ P(w2|spam) ∙ … P(wn|spam) ∙ P(spam) / P(comment)

(naïve assumption)

Probability of w1 occurring given a spam comment

P(w1|spam) =1 – (1 – x/y)n


where x is the number of times w1 appears in all spam messages, y is the total number of words in all spam messages, and n is the length of the given comment

Texas casino Online Texas hold’em

Texas gambling site

P(Texas|spam) = 1 – (1 – 2/5)3 = 0.784

Corpus Incoming Comment





Probability of something being spam



Probability of something being spam ??????



Probability of something being spam ??????

P(ham|comment) = P(w1|ham) ∙ P(w2|ham) ∙ … P(wn|ham) ∙ P(ham) / P(comment)

P(spam|comment) P(w1|spam) ∙ P(w2|spam) ∙ … P(wn|spam) ∙ P(spam)


Probability of something being spam

P(ham|comment) P(w1|ham) ∙ P(w2|ham) ∙ … P(wn|ham) ∙ P(ham)

P(spam|comment) P(w1|spam) ∙ P(w2|spam) ∙ … P(wn|spam) ∙ P(spam))

P(ham|comment) P(w1|ham) ∙ P(w2|ham) ∙ … P(wn|ham) ∙ P(ham))

log(

log(

)

)

log(

log(

log(P(spam|comment)) log(P(w1|spam)) + log(P(w2|spam)) + … log(P(wn|spam)) + log(P(spam))

log(P(ham|comment)) log(P(w1|ham)) + log(P(w2|ham)) + … log(P(wn|ham)) + log(P(ham))

P(spam|comment) = 1 – P(ham|comment)

Fact:

Abuse of notation:

P(s) = P(spam|comment)P(h) = P(ham|comment)

P(s) = 1 – P(h)

m = log(P(s)) – log(P(h))

= log(P(s)/P(h))

em = elog(P(s)/P(h))

= P(s)/P(h)

em ∙ P(h) = P(s)

P(s) = 1 – P(h)


em ∙ P(h) = P(s)

em ∙ P(h) = 1 – P(h)

(em + 1) ∙ P(h) = 1

P(h) = 1/(em+1)P(s) = 1 – P(h)


P(h) = 1/(em+1)P(s) = 1 – P(h)

m = log(P(spam|comment)) – log(P(ham|comment))

P(ham|comment) = 1/(em+1)P(spam|comment) = 1 – P(ham|

comment)

log(P(ham|comment))

log(P(spam|comment))

In practice, just compare

Implementation

Corpus

• A collection of 50 blog pages with 1024 comments

• Manually tagged as spam/non-spam• 67% are spam• Provided by the Informatics Institute

at University of Amsterdam

Blocking Blog Spam with Language Model Disagreement, G. Mishne, D. Carmel, and R. Lempel. In: AIRWeb '05 - First International Workshop on Adversarial Information Retrieval on the Web, at the 14th International World Wide Web Conference (WWW2005), 2005.

Most popular spam wordscasino 0.999918 0.00008207

6

betting 0.999879 0.000120513

texas 0.999813 0.000187148

biz 0.999776 0.000223708

holdem 0.999738 0.000262111

poker 0.999551 0.000448675

pills 0.999527 0.000473407

pokerabc 0.999506 0.000493821

teen 0.999455 0.000544715

online 0.999455 0.000544715

bowl 0.999437 0.000562555

gambling 0.999437 0.000562555

sonneries 0.999353 0.000647359

blackjack 0.999346 0.000653516

pharmacy 0.999254 0.000745723

“Clean” wordsedu 0.00287339 0.997127

projects 0.00270528 0.997295

week 0.00270528 0.997295

etc 0.00270528 0.997295

went 0.00270528 0.997295

inbox 0.00270528 0.997295

bit 0.00270528 0.997295

someone 0.00255576 0.997444

bike 0.00230136 0.997699

already 0.00230136 0.997699

selling 0.00219225 0.997808

making 0.00209302 0.997907

squad 0.00184278 0.998157

left 0.00177216 0.998228

important 0.0013973 0.998603

pimps 0.000427782 0.999572

Implementation

• Corpus parsing and processing• Naïve Bayes algorithm• Randomly select 70% for training,

30% for testing• Stand-alone web service• Written entirely in Python

It’s showtime!

Configurations

• Separator used to tokenize comment• Inclusion of words from header• Classify based only on most significant

words• Double count non-spam comments• Include article body as non-spam example• Boosting

Minimum Error Configuration

• Separator: [â-z<>]+• Header: Both• Significant words: All• Double count: No• Include body: No• Boosting: No

Varying Configuration Parameters

Both Include Tag

0.0

60

.08

0.1

00

.12

0.1

40

.16

Header Inclusion vs. Test Set Error

Header Inclusion Method

Tes

t Se

t Err

or

[â-z]+ [â-z<>]+ \W+

0.0

590

0.0

595

0.0

600

0.0

605

0.0

610

0.0

615

0.0

620

Word Separator vs. Test Set Error

Separator RegEx

Tes

t Se

t Err

or

Varying Configuration Parameters

False True

0.0

590

0.0

595

0.0

600

0.0

605

0.0

610

0.0

615

0.0

620

Double Counting Non Spam

Double Counting

Tes

t Se

t Err

or

2 4 6 8 10 12 14

0.1

00

.15

0.2

00

.25

Top Word Filtering Method vs. Test Set Error

Top Word Filtering Method

Te

st S

et E

rro

r

Boosting

• Naïve Bayes is applied repeatedly to the data.• Produces Weighted Majority Model

bayesModels = empty list

weights = vector(1)

for i in 1 to M:

model = naiveBayes(examples, weights)

error = computeError(model, examples)

weights = adjustWeights(examples, weights, error)

bayesModels[i] = [model, error]

if error==0: break

Boosting

5 10 15 20 25

0.0

16

0.0

18

0.0

20

0.0

22

0.0

24

Boosting Level vs. Training Set Error

Boosting Level

Te

st S

et E

rro

r

5 10 15 20 25

0.1

05

0.1

10

0.1

15

0.1

20

0.1

25

0.1

30

0.1

35

0.1

40

Boosting Level vs. Test Set Error

Boosting Level

Te

st S

et E

rro

r

Future work(or what we did not do)

Data Processing

• Follow links in comment and include words in target web page

• More sophisticated tokenization and URL handling (handling $100,000...)

• Word stemming

Features

• Ability to incorporate incoming comments into corpus

• Ability to mark comment as spam/non-spam

• Assign more weight on page content• Adjust probability table based on

page content, providing content-sensitive filtering

Comments?

No spam, please.

Documents

Comment Spam Identification