20
Know your Neighbors: Web Know your Neighbors: Web Spam Detection using the Spam Detection using the Web Topology Web Topology By Carlos Castillo, Debora By Carlos Castillo, Debora Donato, Aristides Gionis, Donato, Aristides Gionis, Vanessa Murdock and Fabrizio Vanessa Murdock and Fabrizio Silvestri Silvestri Presented by Sovandy Hang Presented by Sovandy Hang CS 4440, Fall 2007 CS 4440, Fall 2007

Know your Neighbors: Web Spam Detection using the Web Topology By Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock and Fabrizio Silvestri

Embed Size (px)

Citation preview

Page 1: Know your Neighbors: Web Spam Detection using the Web Topology By Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock and Fabrizio Silvestri

Know your Neighbors: Know your Neighbors: Web Spam Detection Web Spam Detection

using the Web Topologyusing the Web Topology

By Carlos Castillo, Debora Donato, By Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock Aristides Gionis, Vanessa Murdock

and Fabrizio Silvestriand Fabrizio Silvestri

Presented by Sovandy HangPresented by Sovandy HangCS 4440, Fall 2007CS 4440, Fall 2007

Page 2: Know your Neighbors: Web Spam Detection using the Web Topology By Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock and Fabrizio Silvestri

OutlineOutline About meAbout me IntroductionIntroduction KeywordsKeywords How the process works?How the process works? ConclusionConclusion Questions and answersQuestions and answers

Page 3: Know your Neighbors: Web Spam Detection using the Web Topology By Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock and Fabrizio Silvestri

About MeAbout Me 55thth year CS and IE major year CS and IE major Graduate next summerGraduate next summer Interest: Enterprise Resource Interest: Enterprise Resource

PlanningPlanning Think all softwares should be open Think all softwares should be open

sourcesource

Page 4: Know your Neighbors: Web Spam Detection using the Web Topology By Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock and Fabrizio Silvestri

IntroductionIntroduction Web search is a part of our lives.Web search is a part of our lives. Many businesses rely on web.Many businesses rely on web. There is huge economic incentive for There is huge economic incentive for

commercial website to influence search commercial website to influence search results.results.

Web spamming is cheap and often Web spamming is cheap and often successful.successful.

Web spam degrades the quality of search Web spam degrades the quality of search engine.engine.

Web spam is annoying.Web spam is annoying.

Page 5: Know your Neighbors: Web Spam Detection using the Web Topology By Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock and Fabrizio Silvestri

KeywordsKeywords Web spamWeb spam PagerankPagerank SpamdexingSpamdexing SpamicitySpamicity Graph-based algorithmGraph-based algorithm

Page 6: Know your Neighbors: Web Spam Detection using the Web Topology By Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock and Fabrizio Silvestri

Measurement ToolMeasurement Tool

Page 7: Know your Neighbors: Web Spam Detection using the Web Topology By Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock and Fabrizio Silvestri

How it work?How it work?

Feature Extraction Classification Smoothing Propagation

Stack GraphicalLearning

Clustering

Page 8: Know your Neighbors: Web Spam Detection using the Web Topology By Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock and Fabrizio Silvestri

Feature ExtractionFeature Extraction Data set is obtained by using web crawler.Data set is obtained by using web crawler. For each page, links and its contents are For each page, links and its contents are

obtained.obtained. From data set, a full graph is built.From data set, a full graph is built. For each host and page, certain features For each host and page, certain features

are computed. are computed. Link-based features are extracted from Link-based features are extracted from

hostgraph.hostgraph. Content-based feature are extracted from Content-based feature are extracted from

individual pages.individual pages.

Page 9: Know your Neighbors: Web Spam Detection using the Web Topology By Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock and Fabrizio Silvestri

Linked-based FeatureLinked-based Feature

Some important linked-based features Some important linked-based features are:are:

Degree-related measuresDegree-related measures PageRankPageRank TrustRankTrustRank Truncated PageRankTruncated PageRank Estimation of supportersEstimation of supporters

Page 10: Know your Neighbors: Web Spam Detection using the Web Topology By Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock and Fabrizio Silvestri

Content-based FeatureContent-based Feature

Some important content-based Some important content-based features are:features are:

Fraction of visible textFraction of visible text Compressing rateCompressing rate Corpus precision and corpus recallCorpus precision and corpus recall Query precision and query recallQuery precision and query recall Independent trigram likelihoodIndependent trigram likelihood Entropy of diagramEntropy of diagram

Page 11: Know your Neighbors: Web Spam Detection using the Web Topology By Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock and Fabrizio Silvestri

ClassificationClassification

Create base classifier from link-Create base classifier from link-based content-based features.based content-based features.

Apply cost-sensitive decision tree to Apply cost-sensitive decision tree to classify spam and non-spam hosts.classify spam and non-spam hosts.

Page 12: Know your Neighbors: Web Spam Detection using the Web Topology By Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock and Fabrizio Silvestri

SmoothingSmoothing

Hosts are now labeled as spam and Hosts are now labeled as spam and non-spam by classifier.non-spam by classifier.

It’s an improvement on base It’s an improvement on base classifier.classifier.

Few smoothing techniques are:Few smoothing techniques are: ClusteringClustering Propagation Propagation Stacked graphical learning.Stacked graphical learning.

Page 13: Know your Neighbors: Web Spam Detection using the Web Topology By Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock and Fabrizio Silvestri

Smoothing (Cont.)Smoothing (Cont.)

Based on topological dependencies of spam node:Based on topological dependencies of spam node: Links are not placed at random.Links are not placed at random. Similar pages tends to link more frequently Similar pages tends to link more frequently

than dissimilar pages.than dissimilar pages.

OrOr Spam tends to be clustered on the Web. Non-spam nodes tend to be linked by very few

spam nodes, and usually link to no spam nodes. Spam nodes are mainly linked by spam nodes.

Page 14: Know your Neighbors: Web Spam Detection using the Web Topology By Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock and Fabrizio Silvestri
Page 15: Know your Neighbors: Web Spam Detection using the Web Topology By Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock and Fabrizio Silvestri

Smoothing - ClusteringSmoothing - Clustering Split graph into many clusters.Split graph into many clusters. Use METIS graph clustering Use METIS graph clustering

algorithm.algorithm. If majority of nodes in cluster are If majority of nodes in cluster are

spam, then all hosts in cluster are spam, then all hosts in cluster are spam.spam.

Page 16: Know your Neighbors: Web Spam Detection using the Web Topology By Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock and Fabrizio Silvestri

Smoothing - PropagationSmoothing - Propagation

Propagate predictions using random walks.

Start from node labeled as spam by base classifier then go forward or backward.

Page 17: Know your Neighbors: Web Spam Detection using the Web Topology By Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock and Fabrizio Silvestri

Smoothing – Stack Smoothing – Stack Graphical LearningGraphical Learning

It’s machine learning process.It’s machine learning process. It creates extra features in addition It creates extra features in addition

to content-based and linked-based to content-based and linked-based ones.ones.

Page 18: Know your Neighbors: Web Spam Detection using the Web Topology By Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock and Fabrizio Silvestri

ConclusionConclusion Based on assumption that there is a tendency of Based on assumption that there is a tendency of

spammers to be linked together.spammers to be linked together. Using both link-based and content-based feature Using both link-based and content-based feature

enhance the detection quality.enhance the detection quality.

It can be used on web datasets of any size.It can be used on web datasets of any size. Paper does not explain very well each step. Paper does not explain very well each step.

Page 19: Know your Neighbors: Web Spam Detection using the Web Topology By Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock and Fabrizio Silvestri

Useful ReadingUseful Reading

““Using Spam Farm to Boost PageRank” by Ye Du, Yaoyun Shi, Xin Zhao

““Using Annotations in Enterprise Using Annotations in Enterprise Search”Search” by by Pavel A. Dmitriev, Nadav Eiron, Marcus Fontoura, Eugene Shekita

Page 20: Know your Neighbors: Web Spam Detection using the Web Topology By Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock and Fabrizio Silvestri

Question ?Question ?