View
55
Download
3
Category
Preview:
DESCRIPTION
Know your Neighbors: Web Spam Detection using the Web Topology. Carlos Castillo, chato@yahoo-inc.com Debora Donato , debora@yahoo-inc.com Aristides Gionis , gionis@yahoo-inc.com Vanessa Murdock, vmurdock@yahoo-inc.com Fabrizio Silvestri , f.silvestri@isti.cnr.it - PowerPoint PPT Presentation
Citation preview
Know your Neighbors: Web Spam Detection
using the Web TopologyCarlos Castillo, chato@yahoo-inc.com
Debora Donato, debora@yahoo-inc.comAristides Gionis, gionis@yahoo-inc.com
Vanessa Murdock, vmurdock@yahoo-inc.comFabrizio Silvestri, f.silvestri@isti.cnr.it
Presented by Anton Rodriguez-Dmitriev
Personal BackgroundGraduated from FSUWorking on a MSECESpecializing in ControlsCS minorWork part-time at STW
Technic, LP
Web Spam ConsequencesDamages reputation of search engineWeakens the trust of the usersEiron et al. ranked 100 million pages using
PageRank: 11 out of the top 20 were pornographic pages
PageRank alone cannot filter spamCost incurred in crawling, indexing and storing spam pages
Some popular spamming techniquesLink Spam: create link
structure, usually tightly knit community of links, to try to affect the outcome of the link-based ranking algorithm.
Content Spam: maliciously crafting the content of a Webpage using techniques such as keyword stuffing, inserting keywords that are more related to popular queries
Cloaking: send different content to a search engine than to the regular visitor of a website
Topology of the DatasetUsed WEBSPAM-UK2006
dataset: publically available spam collection
Undirected graphPruned to contain only hosts
that share more than 100 links
Black nodes are spam and white nodes are non-spam
Most spammers in the larger connected component are clustered together
Other connected components are single-class
Evaluation of the processConfusion Matrix:
a represents the number of non-spam examples that were correctly classified
b represents the number of examples of non-spam that were falsely classified as spam
c represents the spam examples that were falsely classified as non-spam
d represents the number of spam examples that were correctly classified
Link-based FeaturesDegree-related measures:
In-degree and out-degree of the hosts and neighborsEdge-reciprocity: the number of links that are reciprocalAssortativity: the ratio between the degree of a
particular page and the average degree of its neighborsPageRankTrustRank: uses a subset of hand-picked trusted nodes
and propagates their labels through the Web graphTruncated PageRank: a variant of PageRank that
diminishes the influence of a page to the PageRank of its neighbors
Link-based FeaturesEstimation of
supporters: Given two nodes x and
y, x is a d-supporter of y, if the shortest path from x to y has length d
Nd(x) is the set of d-supporters of page x
Spam pages have a smaller bottleneck than non-spam
Bottleneck number :
Histogram of b4(x) for spam and non-spam
Content-based FeaturesMost interesting features presented:Finding the k most frequent words in the dataset,
excluding stopwords:Corpus precision: is the fraction of words in a page
that appear in a set of popular termsCorpus recall: to be the fraction of popular terms
that appear in the pageConsidering the set of q most popular terms in a
query log:Query precision and query recall: are analogous to
corpus precision and recall.Used k & q = 100, 200, 500 and 1000
Content-based FeaturesThe best features
are the corpus precision and query precision
All features where judged based only on histograms
Histogram of the query precision in non-spam vs. spam pages for q = 500.
ClassifiersCost-sensitive decision treeCost of zero for correctly
classifying the instanceCost of misclassifying spam
as normal is R times more costly as classifying a normal host as spam
R can be used to tune the balance between the true-positive rate and the false-positive rate
Used “bagging” to help reduce the false-positive rate
ConclusionExperimental evidence led to the hypotheses:
Non-spam nodes tend to be linked by very few spam nodes, and usually link to no spam nodes
Spam nodes are mainly linked by spam nodesThese tendencies can be exploited to yield
better spam detectionUsing multiple features, link-based and
content-based, provided better detectionError rate can be tuned by adjusting the cost
matrix
CritiqueArticle presented many features, both link-based
and content-based, that can be used for spam detection, and also techniques to optimize based on graph topology (smoothing)
Results obtained showed which features and optimizations were effective
Dataset that was used is outdated, so there is no indication on how well the methods would work with newer or more sophisticated spamming techniques
There was no direct comparison between prior research results and the results obtained
Recommended