Upload
william-hodges
View
216
Download
0
Embed Size (px)
Citation preview
Know your Neighbors: Know your Neighbors: Web Spam Detection Web Spam Detection
using the Web Topologyusing the Web Topology
By Carlos Castillo, Debora Donato, By Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock Aristides Gionis, Vanessa Murdock
and Fabrizio Silvestriand Fabrizio Silvestri
Presented by Sovandy HangPresented by Sovandy HangCS 4440, Fall 2007CS 4440, Fall 2007
OutlineOutline About meAbout me IntroductionIntroduction KeywordsKeywords How the process works?How the process works? ConclusionConclusion Questions and answersQuestions and answers
About MeAbout Me 55thth year CS and IE major year CS and IE major Graduate next summerGraduate next summer Interest: Enterprise Resource Interest: Enterprise Resource
PlanningPlanning Think all softwares should be open Think all softwares should be open
sourcesource
IntroductionIntroduction Web search is a part of our lives.Web search is a part of our lives. Many businesses rely on web.Many businesses rely on web. There is huge economic incentive for There is huge economic incentive for
commercial website to influence search commercial website to influence search results.results.
Web spamming is cheap and often Web spamming is cheap and often successful.successful.
Web spam degrades the quality of search Web spam degrades the quality of search engine.engine.
Web spam is annoying.Web spam is annoying.
KeywordsKeywords Web spamWeb spam PagerankPagerank SpamdexingSpamdexing SpamicitySpamicity Graph-based algorithmGraph-based algorithm
Measurement ToolMeasurement Tool
How it work?How it work?
Feature Extraction Classification Smoothing Propagation
Stack GraphicalLearning
Clustering
Feature ExtractionFeature Extraction Data set is obtained by using web crawler.Data set is obtained by using web crawler. For each page, links and its contents are For each page, links and its contents are
obtained.obtained. From data set, a full graph is built.From data set, a full graph is built. For each host and page, certain features For each host and page, certain features
are computed. are computed. Link-based features are extracted from Link-based features are extracted from
hostgraph.hostgraph. Content-based feature are extracted from Content-based feature are extracted from
individual pages.individual pages.
Linked-based FeatureLinked-based Feature
Some important linked-based features Some important linked-based features are:are:
Degree-related measuresDegree-related measures PageRankPageRank TrustRankTrustRank Truncated PageRankTruncated PageRank Estimation of supportersEstimation of supporters
Content-based FeatureContent-based Feature
Some important content-based Some important content-based features are:features are:
Fraction of visible textFraction of visible text Compressing rateCompressing rate Corpus precision and corpus recallCorpus precision and corpus recall Query precision and query recallQuery precision and query recall Independent trigram likelihoodIndependent trigram likelihood Entropy of diagramEntropy of diagram
ClassificationClassification
Create base classifier from link-Create base classifier from link-based content-based features.based content-based features.
Apply cost-sensitive decision tree to Apply cost-sensitive decision tree to classify spam and non-spam hosts.classify spam and non-spam hosts.
SmoothingSmoothing
Hosts are now labeled as spam and Hosts are now labeled as spam and non-spam by classifier.non-spam by classifier.
It’s an improvement on base It’s an improvement on base classifier.classifier.
Few smoothing techniques are:Few smoothing techniques are: ClusteringClustering Propagation Propagation Stacked graphical learning.Stacked graphical learning.
Smoothing (Cont.)Smoothing (Cont.)
Based on topological dependencies of spam node:Based on topological dependencies of spam node: Links are not placed at random.Links are not placed at random. Similar pages tends to link more frequently Similar pages tends to link more frequently
than dissimilar pages.than dissimilar pages.
OrOr Spam tends to be clustered on the Web. Non-spam nodes tend to be linked by very few
spam nodes, and usually link to no spam nodes. Spam nodes are mainly linked by spam nodes.
Smoothing - ClusteringSmoothing - Clustering Split graph into many clusters.Split graph into many clusters. Use METIS graph clustering Use METIS graph clustering
algorithm.algorithm. If majority of nodes in cluster are If majority of nodes in cluster are
spam, then all hosts in cluster are spam, then all hosts in cluster are spam.spam.
Smoothing - PropagationSmoothing - Propagation
Propagate predictions using random walks.
Start from node labeled as spam by base classifier then go forward or backward.
Smoothing – Stack Smoothing – Stack Graphical LearningGraphical Learning
It’s machine learning process.It’s machine learning process. It creates extra features in addition It creates extra features in addition
to content-based and linked-based to content-based and linked-based ones.ones.
ConclusionConclusion Based on assumption that there is a tendency of Based on assumption that there is a tendency of
spammers to be linked together.spammers to be linked together. Using both link-based and content-based feature Using both link-based and content-based feature
enhance the detection quality.enhance the detection quality.
It can be used on web datasets of any size.It can be used on web datasets of any size. Paper does not explain very well each step. Paper does not explain very well each step.
Useful ReadingUseful Reading
““Using Spam Farm to Boost PageRank” by Ye Du, Yaoyun Shi, Xin Zhao
““Using Annotations in Enterprise Using Annotations in Enterprise Search”Search” by by Pavel A. Dmitriev, Nadav Eiron, Marcus Fontoura, Eugene Shekita
Question ?Question ?