Upload
petula
View
25
Download
0
Embed Size (px)
DESCRIPTION
The Efficacy of Collusions in Web Ranking and the Countermeasurements. Hui Zhang University of Southern California. Outline. Problem Statement. PageRank algorithm : a brief introduction. Study of PageRank’s robustness to collusion. - PowerPoint PPT Presentation
Citation preview
The Efficacy of Collusions in Web Ranking
and the Countermeasurements
Hui ZhangUniversity of Southern California
04/22/23 USC CS599 2
• Problem Statement.
• PageRank algorithm : a brief introduction.
• Study of PageRank’s robustness to collusion.
• Adaptive-resetting: make PageRank robust to collusion.
• Conclusions.
Outline
04/22/23 USC CS599 3
Search Engine Optimization (SEO)
04/22/23 USC CS599 4
Web spam [Gyongyin et al. 2004]
• Web spamming refers to actions intended to mislead search engines and give some pages higher ranking than they deserve.
• A spammer will play with two factors which decide the rank score of a page in a query:
Relevance – textual similarity between the query and a page.
Importance – the global popularity of a page, which is query-independent.
04/22/23 USC CS599 5
Collusion in Web ranking
• A manipulation of the hyperlink structure by a group of users with the intention of improving the rating one or more users in the group.
04/22/23 USC CS599 6
PageRank [Brin1998] • An eigenvector-based rating scheme to
rank hypertext documents on the WWW.
• An iterative algorithm to calculate the importance of a web page based on the importance of its parent pages.
• Can be applied to other systems than WWW.
04/22/23 USC CS599 7
PageRank: random walk modelnode
referential linkThe walker
1/21/3
X
Y Z
• As time goes on, the expected percentage of steps the walker is at each node v converges to the PageRank weight PR(v).
With prob. (1-), I will continue the walk to a random successor node.
: resetting probability
With prob. , I will restart the walk at a random node.
: resetting probability
04/22/23 USC CS599 8
PageRank: is it collusion-proof?• Can a node easily boost its rank by
manipulating its out-going links with others’?
I’m not colluding!
I’m not colluding!
I’m not colluding!
I’m not colluding!
04/22/23 USC CS599 9
• In the system of node group G, for a subgroup G’,
the amplification factor Amp(G’) =
)'()'(
GWGW
in
G
':
)()'(Gii
G iPRGW
))'(1(|||'|)1(
)()()'(
',',:),(
GWGG
ioutiPRGW G
jiGjGijiin
WG(G’) =PR(i)+PR(j)
Win(G’) =
real group weight
“actual” group weight
Amp(G): a metric on group collusion
PR(x)3
PR(x)3 (1-) PR(y)
2PR(y)4PR(y)
4
+ (1-)
+ 2N
(1-W(G’))xy
GG’ i j
: resetting probability
04/22/23 USC CS599 10
Answer for (1+1 = ?) in PageRank
• In the original PageRank system,
where is the resetting probability.
2)'(,' GAmpGG
1)'(,1
|||'|, GAmp
GGwhenlySpecifical
04/22/23 USC CS599 11
Two experimental topologies• W, a Web link topology
Contains the link structure of upwards of 80 million URLs.
Source: the Stanford WebBase.
• B, a weblog blogrolling topologyContains the blogrolling structure of upwards of 72,000 blogs.
Source: www.blogstreet.com, the XML-RPC webblog service.
04/22/23 USC CS599 12
• Model a small number of web pages simultaneously colluding.
• Methodology:
•100 colluding groups of 200 nodes;
•Each colluding group has the circle topology consisting of two nodes with adjacent ranks;
•Arbitrarily chose node pairs originally ranked around 1000th, 2000th, …, 100000th.
= 0.15.
Experiment 1: Collusion200
04/22/23 USC CS599 13
Experiment result of Collusion200 (I)
Figure 1: W - Amplification factors of the 100 colluding groups in Collusion200.
04/22/23 USC CS599 14
Experiment result of Collusion200 (III)
Figure 2: W – new PR rank after Collusion200.
Old rank: 1005th
New rank: 67th
Old rank: 10001th
New rank: 450th
Old rank: 100009th
New rank: 5038th
04/22/23 USC CS599 15
There is a long flat portion…
Figure 3: The PR weight distribution of 4 topologies.
04/22/23 USC CS599 16
• Identifying colluding groups is unlikely to be computationally tractable.•The densest k-subgraph problem[Feige et al. 1997].
•The classical CLIQUE problem.
•The problem of finding hiding large cliques in random graphs[Juels 1998].
Next step: how to detect collusions?
04/22/23 USC CS599 17
• Theorem on Hardness.
Max G’G Amp(G’) is a NP-Hard problem.
Hardness on Amp
04/22/23 USC CS599 18
• The revisit intervals of the random walk on a colluding node will likely to have a large variance compared to its expectation.
A counterexample: a star+dangling circle topology
0
12
N N+1
N-1N-2
Figure E:
How about using finer statistics of the random walk
04/22/23 USC CS599 19
An observation on collusion behaviors• To increase their PR weight, i.e., the
stationary weight in the random walk, the colluding nodes will stall the random walk.
• When the resetting probability increases, the colluding nodes must suffer a significant drop in PR weight.
• Therefore, we expect the PR weight of colluding nodes to be highly correlated with 1/ (the average walk length), while that of non-colluding nodes is relatively insensitive to the change in .
GG’
04/22/23 USC CS599 20
An intuitive examplenode
referential link
04/22/23 USC CS599 21
An intuitive examplenode
referential link
A colluding group
04/22/23 USC CS599 22
An intuitive examplenode
referential link
A colluding group
• A colluding node x: PR(x) = , and co-co(PR(x), 1/ ) 1. (co-co: correlation coefficient)
• A non-colluding node y: PR(x) = , and co-co(PR(y), 1/ ) 0.
NKNK1
)(1
NKNK1
)(
x
y
N: the system size; K: the colluding group size; K << N.
04/22/23 USC CS599 23
• Part I – collusion detection:Given the topology, calculate the PR vector under different values.
{} = {0.0375, 0.05, 0.075, 0.15, 0.3, 0.45, 0.6}, default = 0.15.
Calculate the correlation coefficient between the curve of each node x's PR weight and the curve of 1/ . Label it as co-co(x).
• Part II – personalization:Calculate each node x's out-link personalized- = F(default, co-co(x)).
Exponential function FExp= .
Linear function FLinear= default+(0.5-default)*co-co(x)
The final PR weight vector is calculated with these personalized resetting values.
))(0.1( xcocodefault
Adaptive-resetting scheme
04/22/23 USC CS599 24
Experiment result of Collusion200 (IV)
Figure 5: W - Amplification factors of the 100 colluding groups in Collusion200.
04/22/23 USC CS599 25
Experiment result of Collusion200 (VI)
Figure 6: W – new PR rank after Collusion200.
04/22/23 USC CS599 26
• Model various colluding subgraphs.• Methodology:
3 colluding groups:node
referential link
G1: 10-node ring G2: 10-node star topology
G3: 2-node ring
Experiment 2: Collusion22
04/22/23 USC CS599 27
Experiment result of Collusion22 (I)
Figure 7: Amplification factors of the 3 colluding groups in Collusion22.
04/22/23 USC CS599 28
Experiment result of Collusion22 (II)
Figure 8: W – new PR weight after Collusion22.
04/22/23 USC CS599 29
New top-25 URL list in W Dropped outDropping New
04/22/23 USC CS599 30
Conclusions• Simple collusions lead to effective Web ranking
improvement.
• A simple scheme based on PageRank algorithm effectively counteracts Web ranking collusions.
04/22/23 USC CS599 31
Backup slides
04/22/23 USC CS599 32
• A means of describing social trust networks.
• The basic concept is a democratic meritocracy.
• A rating system is used to evaluate individual members, and those results are then collated to produce a consensus about the merit of any given member.
• Examples: Livejournal, Friendster, eBay, Advogato
Reputation systems [Okita2003]
04/22/23 USC CS599 33
• Assume N pages.• Assign all pages the initial value 1/N• Let Nu be the out-degree of Page u, Rank(v)
the importance of Page v, Bv the set of pages pointing to v. • Basic algorithmv Rank(v) =
vBuuNuRank /)(
• Enhanced algorithm against rank sinksv Rank(v) =
vBu
u NNuRank //)()1(
: damping factor
PageRank algorithm [Brin1998]
04/22/23 USC CS599 34
Figure 4: the co-co PDF distribution in W and B: the [0, 0.1] range actually corresponds to [-1, 0.1] range.
Co-co distribution in real-world graphs
04/22/23 USC CS599 35
Figure A: W – new PR weight after Collusion200.
Experiment result of Collusion200 (II)
04/22/23 USC CS599 36
Figure B: B – new PR rank after Collusion200
Experiment result of Collusion200 (VII)
04/22/23 USC CS599 37
Figure C: B – new PR weight after Collusion200
Experiment result of Collusion200 (X)
04/22/23 USC CS599 38
Figure 6: W – new PR weight after Collusion200.
Experiment result of Collusion200 (V)
04/22/23 USC CS599 39
Correlation coefficient
04/22/23 USC CS599 40
Figure D: W – new PR rank after Collusion22.
Experiment result of Collusion22 (III)