19
NTU Natural Language Processing Lab. 1 Blog Track Open Task: Spam Blog Classification Advisor: Hsin-Hsi Chen Speaker: Sheng-Chung Yen Date: 2007/01/08 P. Kolari, T. Finin, A. Java and J. Mayfield University of Maryland Baltimore Country and Johns Hopkins University Applied Physics Laboratory

NTU Natural Language Processing Lab. 1 Blog Track Open Task: Spam Blog Classification Advisor: Hsin-Hsi Chen Speaker: Sheng-Chung Yen Date: 2007/01/08

Embed Size (px)

DESCRIPTION

3 NTU Natural Language Processing Lab. Conclusion This paper: proposes a spam blog classification task for TREC Blog Track 2007 argues why it forms an important part of blog analytics surveys existing techniques on eliminating them. shows how it impacted the primary task of TREC Blog Track 2006 puts forward assessment and evaluation for such a task to be adopted in TREC Blog Track 2007

Citation preview

Page 1: NTU Natural Language Processing Lab. 1 Blog Track Open Task: Spam Blog Classification Advisor: Hsin-Hsi Chen Speaker: Sheng-Chung Yen Date: 2007/01/08

NTU Natural Language Processing Lab.

1

Blog Track Open Task: Spam Blog Classification

Advisor: Hsin-Hsi ChenSpeaker: Sheng-Chung Yen

Date: 2007/01/08

P. Kolari, T. Finin, A. Java and J. MayfieldUniversity of Maryland Baltimore Country and

Johns Hopkins University Applied Physics Laboratory

Page 2: NTU Natural Language Processing Lab. 1 Blog Track Open Task: Spam Blog Classification Advisor: Hsin-Hsi Chen Speaker: Sheng-Chung Yen Date: 2007/01/08

2

NTU Natural Language Processing Lab.

OutlineIntroductionSplog Detection ProblemDetecting SplogsTREC Blog Track 2006Splog Task AssessmentConclusion

Page 3: NTU Natural Language Processing Lab. 1 Blog Track Open Task: Spam Blog Classification Advisor: Hsin-Hsi Chen Speaker: Sheng-Chung Yen Date: 2007/01/08

3

NTU Natural Language Processing Lab.

ConclusionThis paper:

proposes a spam blog classification task for TREC Blog Track 2007argues why it forms an important part of blog analyticssurveys existing techniques on eliminating them. shows how it impacted the primary task of TREC Blog Track 2006puts forward assessment and evaluation for such a task to be adopted in TREC Blog Track 2007

Page 4: NTU Natural Language Processing Lab. 1 Blog Track Open Task: Spam Blog Classification Advisor: Hsin-Hsi Chen Speaker: Sheng-Chung Yen Date: 2007/01/08

4

NTU Natural Language Processing Lab.

IntroductionSpam blogs or splogs refer to blogs created for the sole purpose of hosting ads, promoting page rank of affiliates and getting new content indexed.

This open task submission details How splogs impact Opinion Identification.Proposes an approach to assessment and evaluation for a Spam Blog Classification task in 2007.

Page 5: NTU Natural Language Processing Lab. 1 Blog Track Open Task: Spam Blog Classification Advisor: Hsin-Hsi Chen Speaker: Sheng-Chung Yen Date: 2007/01/08

5

NTU Natural Language Processing Lab.

Splog Detecting ProblemA Post from Splog:

1. Display ads in high paying contexts.2. Features content plagiarized (抄襲 ) from other blogs.3. Hosts hyperlinks that create link farms.

Splog Detection is a classification problem within the blogosphere subset B.

BA: represents all authentic content

BS: represents content from splogs

BU: represents those blog pages for which a judgment of authenticity or spam has not yet been made

Page 6: NTU Natural Language Processing Lab. 1 Blog Track Open Task: Spam Blog Classification Advisor: Hsin-Hsi Chen Speaker: Sheng-Chung Yen Date: 2007/01/08

6

NTU Natural Language Processing Lab.

1

2

3

Page 7: NTU Natural Language Processing Lab. 1 Blog Track Open Task: Spam Blog Classification Advisor: Hsin-Hsi Chen Speaker: Sheng-Chung Yen Date: 2007/01/08

7

NTU Natural Language Processing Lab.

BABS

BU

B

Page 8: NTU Natural Language Processing Lab. 1 Blog Track Open Task: Spam Blog Classification Advisor: Hsin-Hsi Chen Speaker: Sheng-Chung Yen Date: 2007/01/08

8

NTU Natural Language Processing Lab.

Detecting SplogsAll models are based on SVMs

Words (bag-of-words)– Ex: “I”, “We”, “my”, “what” authentic blog

Word N-Gram– Ex: “comments-off”, “in-uncategorized” splog– Ex: “2-comments”, “1-comments”, “I have”, “to my”

authentic blog

Tokenized Anchors– Anchor text: <a>anchor text</a>– “comment”, “flickr” authentic blog

Page 9: NTU Natural Language Processing Lab. 1 Blog Track Open Task: Spam Blog Classification Advisor: Hsin-Hsi Chen Speaker: Sheng-Chung Yen Date: 2007/01/08

9

NTU Natural Language Processing Lab.

Tokenized URLs– Point to “.info” domain splog– Point to “flickr”, “technorati” and “feedster” authentic blog

Global Models– Authentic blogs are very unlikely to link to splogs.– Splogs frequently do link to other splogs.

Other Techniques– Ping server– Url/IP blacklists

Page 10: NTU Natural Language Processing Lab. 1 Blog Track Open Task: Spam Blog Classification Advisor: Hsin-Hsi Chen Speaker: Sheng-Chung Yen Date: 2007/01/08

10

NTU Natural Language Processing Lab.

TREC Blog Track 200617969 feeds from splogs, contributing 15.8% of the documents.

The number of splogs present varies since splogs are query dependent.

Page 11: NTU Natural Language Processing Lab. 1 Blog Track Open Task: Spam Blog Classification Advisor: Hsin-Hsi Chen Speaker: Sheng-Chung Yen Date: 2007/01/08

11

NTU Natural Language Processing Lab.

Cholesterol(膽固醇 )

Hybrid cars

Page 12: NTU Natural Language Processing Lab. 1 Blog Track Open Task: Spam Blog Classification Advisor: Hsin-Hsi Chen Speaker: Sheng-Chung Yen Date: 2007/01/08

12

NTU Natural Language Processing Lab.

Splog Task AssessmentThe classification of splogs:

Non-blogKeyword-stuffingPost-stitchingPost-plagiarismPost-weavingLink-spam

Page 13: NTU Natural Language Processing Lab. 1 Blog Track Open Task: Spam Blog Classification Advisor: Hsin-Hsi Chen Speaker: Sheng-Chung Yen Date: 2007/01/08

13

NTU Natural Language Processing Lab.

Non-blog

Page 14: NTU Natural Language Processing Lab. 1 Blog Track Open Task: Spam Blog Classification Advisor: Hsin-Hsi Chen Speaker: Sheng-Chung Yen Date: 2007/01/08

14

NTU Natural Language Processing Lab.

Keyword-stuffing

Page 15: NTU Natural Language Processing Lab. 1 Blog Track Open Task: Spam Blog Classification Advisor: Hsin-Hsi Chen Speaker: Sheng-Chung Yen Date: 2007/01/08

15

NTU Natural Language Processing Lab.

Post-stitching

Page 16: NTU Natural Language Processing Lab. 1 Blog Track Open Task: Spam Blog Classification Advisor: Hsin-Hsi Chen Speaker: Sheng-Chung Yen Date: 2007/01/08

16

NTU Natural Language Processing Lab.

Post-plagiarism

Page 17: NTU Natural Language Processing Lab. 1 Blog Track Open Task: Spam Blog Classification Advisor: Hsin-Hsi Chen Speaker: Sheng-Chung Yen Date: 2007/01/08

17

NTU Natural Language Processing Lab.

Post-weaving

Page 18: NTU Natural Language Processing Lab. 1 Blog Track Open Task: Spam Blog Classification Advisor: Hsin-Hsi Chen Speaker: Sheng-Chung Yen Date: 2007/01/08

18

NTU Natural Language Processing Lab.

Link-spam

Page 19: NTU Natural Language Processing Lab. 1 Blog Track Open Task: Spam Blog Classification Advisor: Hsin-Hsi Chen Speaker: Sheng-Chung Yen Date: 2007/01/08

19

NTU Natural Language Processing Lab.

ConclusionThis paper:

proposes a spam blog classification task for TREC Blog Track 2007argues why it forms an important part of blog analyticssurveys existing techniques on eliminating them. shows how it impacted the primary task of TREC Blog Track 2006puts forward assessment and evaluation for such a task to be adopted in TREC Blog Track 2007