Upload
javier-ortega
View
1.742
Download
3
Tags:
Embed Size (px)
DESCRIPTION
Presentation of PolaritySpam, a graph-based ranking algorithm intended to demote the spam web pages in the ranking provided by a web search engine. Cite as: F. Javier Ortega; Craig Macdonald; José A. Troyano; Fermín L. Cruz. “Spam Detection with a Content-based Random-Walk Algorithm”. Proceedings of the Second International Workshop on Search and Mining User-Generated Contents, International Conference on Information and Knowledge Management. 2010. Toronto, Canadá
Citation preview
Spam detection with a content-based random-walk
algorithm
F. Javier [email protected]é A. [email protected]
Craig [email protected]ín [email protected]
Index
♦ Introduction♦ Related work
♦ Content-based
♦ Link-based
♦ Our Approach♦ Random-walk algorithm
♦ Content-based metrics
♦ Selection of seeds
♦ Experiments♦ Future work♦ References
Introduction
♦ Web Spam: phenomenon where a number of web pages are created for the purpose of making a search engine deliver undesirable results for a given query.
Introduction
♦ Self-Promotion: gaining high relevance for a search engine, mainly based on the textual content.
i.e.: including a number of keywords in the web page.
Introduction
♦ Mutual-Promotion: gaining high score by focusing the attention on the out-links and in-links of a web page.i.e.: a web page with lots of in-links
can be considered relevant by a search
engine.
Introduction
♦ Web Spam characteristics:
♦ Textual content: large amount of invisible content, a set of words with high frequency, lots of hyperlinks with large anchor texts, very long words, etc.
♦ Link-farms: large number of pages pointing one to another, in order to improve their scores by increasing the number of in-links to them.
♦Good pages usually point to good pages.
♦Spam pages mainly point to other spam pages (link-farms). They rarely point to good pages.
Related work: Content-based
♦ Content-based techniques classify the web pages as spam or not-spam according to their textual content.
♦ Heuristics to determine the spam likehood of a web page.♦ Meta tag content, anchor texts, URL of the page, average lenght of
the words, compression rate, etc. [10, 12]
♦ Inclusion of link-based scores and metrics into a classifier [3]
♦ Link-based techniques exploit the relations between web pages to obtain a rank of pages, ordered according to their spam likelihood.
♦ Random-Walk algorithms that penalizes spam-like behaviors.♦ Don't take into account the nearest neighbours [1]
♦ Take only the scores received from a specific set of good or bad pages. [7,11]
Our Approach
♦ Our approach combines both techniques:♦ A set of content-based metrics, that
obtains information from each single web page.
♦ A link-based algorithm, that processes the relations between web pages.
♦ The goal is to obtain a ranking of web pages, in which spam web pages are demoted according to their spam likelihood.
Our Approach
Web pages
Content-based metrics
Selection of Seeds
Random-walk algorithm
Web graph
Our Approach: random-walk algorithm
♦ We propose a random-walk algorithm that computes two scores for each web page:
♦PR :⁺ relevance of a web page♦PR :⁻ spam likelihood of a web page
♦ PR (⁻ b), changes according to the relation of b with spam-like web pages. Analogous with PR .⁺
a bThe higher PR (a), the higher PR (b).⁺ ⁺The higher PR (a), the higher PR (b).⁻ ⁻
Our Approach: random-walk algorithm
♦ Formula:
♦ Intuition:High PR⁻High PR⁺
Higher PR !!⁺ Higher PR !!⁻
Our Approach: content-based metrics
♦ Content-based metrics are intended to extract some a-priori information from the textual content of the web pages.
♦ Content-based metrics must be:♦ Easy to obtain: save the performance!♦ Accurate: precision is preferred over recall.
Our Approach: content-based metrics
♦ Selected metrics:♦ Compressibility: fraction of the sizes of a web
page, before and after being compressed.♦ Fraction of globally popular words: a web
page with a high fraction of words within the most popular words in the entire corpus, is likely to be a spam.
♦ Average length of words: non-spam web pages have a bell-shaped distribution of average word lengths, while malicious pages have much higher values.
Our Approach: selection of seeds
♦ Seeds: set of relevant nodes, in terms of spam (negative seeds) or not-spam likelihood (positive seeds).
♦ The algorithm gives more relevance to the seeds.
♦ Spam-biased algorithm
Our Approach: selection of seeds
♦ Unsupervised method: content-based metrics as features to choose the seeds.
♦ Pros:♦Human intervention is not needed.♦Larger number of seeds can be considered.♦Inclusion of text content into a link-based
method.
♦ Due to the lack of human intervention...♦“False positives”.
Our Approach: selection of seeds
♦ Obtaining a-priori score for a node, a:
♦ Selecting seeds:♦ Pos/Neg Approach:
♦ Pos/Neg Metrics Approach:
♦ Metric-based Approach
Experiments
♦ Dataset: WEBSPAM-UK2006*♦ ~98 million pages
♦ 11,402 hand-labeled hosts
♦ 7,423 labeled as spam.
♦ ~10 million spam web pages
♦ Terrier IR Platform
♦ Random-walk algorithm parameters:♦ Damping factor = 0.85
♦ Threshold = 0.01* C. Castillo, D. Donato, L. Becchetti, P. Boldi, S. Leonardi, M. Santini, and S. Vigna. A reference collection for web spam. SIGIR Forum, 40(2):11–24, December 2006.
Experiments
♦ Evaluation: PR-buckets
}}}
}
PageRank
Relevance
Buckets Total Pages
1 14
2 54
3 144
4 437
5 1070
6 2130
7 2664
8 2778
... ...
17 16M
18 28M
19 28M
20 28M
PR-bucket 1
PR-bucket 2
PR-bucket 3
PR-bucket 4
. . .
Total PR =
Experiments
♦ Baseline: TrustRank♦ Link-based technique.
♦ Seeds chosen in a semi-supervised way:• Hand-picked set of good pages.
• Top pages according to an inverse PageRank.
♦ Random-walk algorithm, biased according to the seeds
Z. Gyongyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with trustrank. Technical Report 2004-17, Stanford InfoLab, March 2004
ExperimentsTrustRank Pos/Neg Approach
Metric-based ApproachPos/Neg Metrics Approach
Experiments
1 2 3 4 5 6 7 8 9 10
1
10
100
1000
TrustRank Pos/Neg Pos/Neg Metrics MetricsBased
Conclusions and future work
♦ Novel web spam detection technique, that combines concepts from link and content-based methods.♦ Content-based metrics as an unsupervised seed
selection method.
♦ Random-walk algorithm to compute two scores for each web page: spam and not-spam likelihood.
♦ Future work:♦ Including new content-based heuristics.
♦ Improving the spam-biased selection of the seeds, taking into account the links to/from each node.
♦ Content-based metrics to characterize also the edges of the web graph.
References[1] L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates. Link-based characterization and detection of web
spam. In AIRWeb’06: Adversarial Information Retrieval on the Web, 2006.
[2] A. A. Benczur, K. Csalogany, T. Sarlos, M. Uher, and M. Uher. Spamrank - fully automatic link spam detection. In In Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb, 2005.
[3] C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. Know your neighbors: web spam detection using the web topology. In SIGIR ’07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 423–430, New York, NY, USA, 2007. ACM.
[4] G. V. Cormack, M. D. Smucker, and C. L. A. Clarke. Efficient and effective spam filtering and re-ranking for large web datasets. Computing Research Repository, 2010.
[5] L. da F. Costa, F. A. Rodrigues, G. Travieso, and P. R. V. Boas. Characterization of complex networks: A survey of measurements. Advances in Physics, 56(1):167–242, January 2005.
[6] D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. In WebDB ’04: Proceedings of the 7th International Workshop on the Web and Databases, pages 1–6, New York, NY, USA, 2004. ACM.
[7] Z. Gyongyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with trustrank. Technical Report 2004-17, Stanford InfoLab, March 2004.
[8] T. H. Haveliwala. Topic-sensitive pagerank: A context-sensitive ranking algorithm for web search. Technical Report 2003-29, 2003.2.
[9] G. Jeh and J. Widom. Simrank: a measure of structural-context similarity. In KDD ’02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 538–543, New York, NY, USA, 2002. ACM.
[10] P. Kolari, T. Finin, and A. Joshi. Svms for the blogosphere: Blog identification and splog detection. In AAAI Spring Symposium on Computational Approaches to Analysing Weblogs. Computer Science and Electrical Engineering, University of Maryland, Baltimore County, March 2006.
[11] V. Krishnan. Web spam detection with anti-trustrank. In ACM SIGIR workshop on Adversarial Information Retrieval on the Web, Seattle, Washington, USA, 2006.
[12] A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In WWW ’06: Proceedings of the 15th international conference on World Wide Web, pages 83–92, New York, NY, USA, 2006. ACM.
[13] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web, 1999.
[14] B. Wu, V. Goel, and B. D. Davison. Propagating trust and distrust to demote web spam. In Proceedings of Models of Trust for the Web (MTW), a workshop at the 15th International World Wide Web Conference, Edinburgh, Scotland, 2006.
Thanks for your attention!!
Questions?
F. Javier [email protected]é A. [email protected]
Craig [email protected]ín [email protected]