Know your Neighbors: Web Spam Detection using the Web Topology

Know your Neighbors: Web Spam Detection

using the Web TopologyCarlos Castillo, chato@yahoo-inc.com

Debora Donato, debora@yahoo-inc.comAristides Gionis, gionis@yahoo-inc.com

Vanessa Murdock, vmurdock@yahoo-inc.comFabrizio Silvestri, f.silvestri@isti.cnr.it

Presented by Anton Rodriguez-Dmitriev

Personal BackgroundGraduated from FSUWorking on a MSECESpecializing in ControlsCS minorWork part-time at STW

Technic, LP

Web Spam ConsequencesDamages reputation of search engineWeakens the trust of the usersEiron et al. ranked 100 million pages using

PageRank: 11 out of the top 20 were pornographic pages

PageRank alone cannot filter spamCost incurred in crawling, indexing and storing spam pages

Some popular spamming techniquesLink Spam: create link

structure, usually tightly knit community of links, to try to affect the outcome of the link-based ranking algorithm.

Content Spam: maliciously crafting the content of a Webpage using techniques such as keyword stuffing, inserting keywords that are more related to popular queries

Cloaking: send different content to a search engine than to the regular visitor of a website

Topology of the DatasetUsed WEBSPAM-UK2006

dataset: publically available spam collection

Undirected graphPruned to contain only hosts

that share more than 100 links

Black nodes are spam and white nodes are non-spam

Most spammers in the larger connected component are clustered together

Other connected components are single-class

Evaluation of the processConfusion Matrix:

a represents the number of non-spam examples that were correctly classified

b represents the number of examples of non-spam that were falsely classified as spam

c represents the spam examples that were falsely classified as non-spam

d represents the number of spam examples that were correctly classified

Success MeasuresTrue positive-rate (or Recall):

False positive-rate :

Precision:

F-measure :

Link-based FeaturesDegree-related measures:

In-degree and out-degree of the hosts and neighborsEdge-reciprocity: the number of links that are reciprocalAssortativity: the ratio between the degree of a

particular page and the average degree of its neighborsPageRankTrustRank: uses a subset of hand-picked trusted nodes

and propagates their labels through the Web graphTruncated PageRank: a variant of PageRank that

diminishes the influence of a page to the PageRank of its neighbors

Link-based FeaturesEstimation of

supporters: Given two nodes x and

y, x is a d-supporter of y, if the shortest path from x to y has length d

Nd(x) is the set of d-supporters of page x

Spam pages have a smaller bottleneck than non-spam

Bottleneck number :

Histogram of b4(x) for spam and non-spam

Content-based FeaturesMost interesting features presented:Finding the k most frequent words in the dataset,

excluding stopwords:Corpus precision: is the fraction of words in a page

that appear in a set of popular termsCorpus recall: to be the fraction of popular terms

that appear in the pageConsidering the set of q most popular terms in a

query log:Query precision and query recall: are analogous to

corpus precision and recall.Used k & q = 100, 200, 500 and 1000

Content-based FeaturesThe best features

are the corpus precision and query precision

All features where judged based only on histograms

Histogram of the query precision in non-spam vs. spam pages for q = 500.

ClassifiersCost-sensitive decision treeCost of zero for correctly

classifying the instanceCost of misclassifying spam

as normal is R times more costly as classifying a normal host as spam

R can be used to tune the balance between the true-positive rate and the false-positive rate

Used “bagging” to help reduce the false-positive rate

ConclusionExperimental evidence led to the hypotheses:

Non-spam nodes tend to be linked by very few spam nodes, and usually link to no spam nodes

Spam nodes are mainly linked by spam nodesThese tendencies can be exploited to yield

better spam detectionUsing multiple features, link-based and

content-based, provided better detectionError rate can be tuned by adjusting the cost

matrix

CritiqueArticle presented many features, both link-based

and content-based, that can be used for spam detection, and also techniques to optimize based on graph topology (smoothing)

Results obtained showed which features and optimizations were effective

Dataset that was used is outdated, so there is no indication on how well the methods would work with newer or more sophisticated spamming techniques

There was no direct comparison between prior research results and the results obtained

Know your Neighbors: Web Spam Detection using the Web Topology

Documents

Spam spam spam spam. Lovely spam! Wonderful spam! Spam spa ... Verdel.pdf · Als er iets is dat ik geleerd heb in de afgelopen 23 jaar waarin ik onderwijs heb mogen genieten, dan

Handling Spam in Postfix. Computer Center, CS, NCTU 2 Nature of Spam Spam UBE – Unsolicited Bulk Email UCE – Unsolicited Commercial Email Spam There

Spam Overview. What is Spam? Spam is unsolicited email in the form of: Commercial advertising Phishing Virus-generated Spam Scams E.g. Nigerian

Spam – Solving it Economically?klarson/teaching/F04-886/... · Spam, tomato & Spam, egg & Spam, Egg, bacon & Spam...") that was current when spam first began arriving on the Internet

· Web viewThe OSPF protocol is a link-state routing protocol, which means that the routers exchange topology information with their nearest neighbors. The topology information is

k-Neighbors Approach to Interference Bounded and Symmetric ...users.ece.gatech.edu/dblough/research/papers/tmobile06b.pdf · topology and increasing the network capacity can be rewarding,

HOW MUCH SPAM CAN CAN-SPAM CAN?: EVALUATING THE

Neighbors for Neighbors Community Club Newsletter

Spam Filter

Spam Detection Jingrui He 10/08/2007. Spam Types Email Spam Unsolicited commercial email Blog Spam Unwanted comments in blogs Splogs Fake blogs

Spam Spam Spam Spam

Doctor Spam

Anti-Spam Spam Manager User Guide - Symantecimages.messagelabs.com/.../AntiSpam_SpamManagerUserGuide.pdf · Spanish Swedish About Spam Manager Spam is unwanted email, often promoting

Telmex Spam

Detecting Spam and Spam Responses

Spam 2011: Protecting Against Evolving Threatshosteddocs.ittoolbox.com/wpproofpointspam2011protectionagainst… · Spam 2011: Protection Against Evolving Threats. The very best anti-spam

Business Neighbors - Business Neighbors 2015

SPAM And SpamAssassin - haifux.orghaifux.org/hebrew/lectures/155/spam.pdf · The SPAM arms race SPAM filtering is a unique problem in AI. Moving target – SPAM keeps changing. –

Network Topology Physical & Logical Topology Bus Topology Ring Topology Star Topology Tree Topology Mesh Topology Combined Topologies

EK Ch 17: Power laws and rich-get-richer phenomena (with an application of Web Spam detection Spam, Damn Spam and Statistics ) Spam, Damn Spam and Statistics