40
The Efficacy of Collusions in Web Ranking and the Countermeasurements Hui Zhang University of Southern California

The Efficacy of Collusions in Web Ranking and the Countermeasurements

  • Upload
    petula

  • View
    25

  • Download
    0

Embed Size (px)

DESCRIPTION

The Efficacy of Collusions in Web Ranking and the Countermeasurements. Hui Zhang University of Southern California. Outline. Problem Statement. PageRank algorithm : a brief introduction. Study of PageRank’s robustness to collusion. - PowerPoint PPT Presentation

Citation preview

Page 1: The Efficacy of Collusions  in Web Ranking  and the Countermeasurements

The Efficacy of Collusions in Web Ranking

and the Countermeasurements

Hui ZhangUniversity of Southern California

Page 2: The Efficacy of Collusions  in Web Ranking  and the Countermeasurements

04/22/23 USC CS599 2

• Problem Statement.

• PageRank algorithm : a brief introduction.

• Study of PageRank’s robustness to collusion.

• Adaptive-resetting: make PageRank robust to collusion.

• Conclusions.

Outline

Page 3: The Efficacy of Collusions  in Web Ranking  and the Countermeasurements

04/22/23 USC CS599 3

Search Engine Optimization (SEO)

Page 4: The Efficacy of Collusions  in Web Ranking  and the Countermeasurements

04/22/23 USC CS599 4

Web spam [Gyongyin et al. 2004]

• Web spamming refers to actions intended to mislead search engines and give some pages higher ranking than they deserve.

• A spammer will play with two factors which decide the rank score of a page in a query:

Relevance – textual similarity between the query and a page.

Importance – the global popularity of a page, which is query-independent.

Page 5: The Efficacy of Collusions  in Web Ranking  and the Countermeasurements

04/22/23 USC CS599 5

Collusion in Web ranking

• A manipulation of the hyperlink structure by a group of users with the intention of improving the rating one or more users in the group.

Page 6: The Efficacy of Collusions  in Web Ranking  and the Countermeasurements

04/22/23 USC CS599 6

PageRank [Brin1998] • An eigenvector-based rating scheme to

rank hypertext documents on the WWW.

• An iterative algorithm to calculate the importance of a web page based on the importance of its parent pages.

• Can be applied to other systems than WWW.

Page 7: The Efficacy of Collusions  in Web Ranking  and the Countermeasurements

04/22/23 USC CS599 7

PageRank: random walk modelnode

referential linkThe walker

1/21/3

X

Y Z

• As time goes on, the expected percentage of steps the walker is at each node v converges to the PageRank weight PR(v).

With prob. (1-), I will continue the walk to a random successor node.

: resetting probability

With prob. , I will restart the walk at a random node.

: resetting probability

Page 8: The Efficacy of Collusions  in Web Ranking  and the Countermeasurements

04/22/23 USC CS599 8

PageRank: is it collusion-proof?• Can a node easily boost its rank by

manipulating its out-going links with others’?

I’m not colluding!

I’m not colluding!

I’m not colluding!

I’m not colluding!

Page 9: The Efficacy of Collusions  in Web Ranking  and the Countermeasurements

04/22/23 USC CS599 9

• In the system of node group G, for a subgroup G’,

the amplification factor Amp(G’) =

)'()'(

GWGW

in

G

':

)()'(Gii

G iPRGW

))'(1(|||'|)1(

)()()'(

',',:),(

GWGG

ioutiPRGW G

jiGjGijiin

WG(G’) =PR(i)+PR(j)

Win(G’) =

real group weight

“actual” group weight

Amp(G): a metric on group collusion

PR(x)3

PR(x)3 (1-) PR(y)

2PR(y)4PR(y)

4

+ (1-)

+ 2N

(1-W(G’))xy

GG’ i j

: resetting probability

Page 10: The Efficacy of Collusions  in Web Ranking  and the Countermeasurements

04/22/23 USC CS599 10

Answer for (1+1 = ?) in PageRank

• In the original PageRank system,

where is the resetting probability.

2)'(,' GAmpGG

1)'(,1

|||'|, GAmp

GGwhenlySpecifical

Page 11: The Efficacy of Collusions  in Web Ranking  and the Countermeasurements

04/22/23 USC CS599 11

Two experimental topologies• W, a Web link topology

Contains the link structure of upwards of 80 million URLs.

Source: the Stanford WebBase.

• B, a weblog blogrolling topologyContains the blogrolling structure of upwards of 72,000 blogs.

Source: www.blogstreet.com, the XML-RPC webblog service.

Page 12: The Efficacy of Collusions  in Web Ranking  and the Countermeasurements

04/22/23 USC CS599 12

• Model a small number of web pages simultaneously colluding.

• Methodology:

•100 colluding groups of 200 nodes;

•Each colluding group has the circle topology consisting of two nodes with adjacent ranks;

•Arbitrarily chose node pairs originally ranked around 1000th, 2000th, …, 100000th.

= 0.15.

Experiment 1: Collusion200

Page 13: The Efficacy of Collusions  in Web Ranking  and the Countermeasurements

04/22/23 USC CS599 13

Experiment result of Collusion200 (I)

Figure 1: W - Amplification factors of the 100 colluding groups in Collusion200.

Page 14: The Efficacy of Collusions  in Web Ranking  and the Countermeasurements

04/22/23 USC CS599 14

Experiment result of Collusion200 (III)

Figure 2: W – new PR rank after Collusion200.

Old rank: 1005th

New rank: 67th

Old rank: 10001th

New rank: 450th

Old rank: 100009th

New rank: 5038th

Page 15: The Efficacy of Collusions  in Web Ranking  and the Countermeasurements

04/22/23 USC CS599 15

There is a long flat portion…

Figure 3: The PR weight distribution of 4 topologies.

Page 16: The Efficacy of Collusions  in Web Ranking  and the Countermeasurements

04/22/23 USC CS599 16

• Identifying colluding groups is unlikely to be computationally tractable.•The densest k-subgraph problem[Feige et al. 1997].

•The classical CLIQUE problem.

•The problem of finding hiding large cliques in random graphs[Juels 1998].

Next step: how to detect collusions?

Page 17: The Efficacy of Collusions  in Web Ranking  and the Countermeasurements

04/22/23 USC CS599 17

• Theorem on Hardness.

Max G’G Amp(G’) is a NP-Hard problem.

Hardness on Amp

Page 18: The Efficacy of Collusions  in Web Ranking  and the Countermeasurements

04/22/23 USC CS599 18

• The revisit intervals of the random walk on a colluding node will likely to have a large variance compared to its expectation.

A counterexample: a star+dangling circle topology

0

12

N N+1

N-1N-2

Figure E:

How about using finer statistics of the random walk

Page 19: The Efficacy of Collusions  in Web Ranking  and the Countermeasurements

04/22/23 USC CS599 19

An observation on collusion behaviors• To increase their PR weight, i.e., the

stationary weight in the random walk, the colluding nodes will stall the random walk.

• When the resetting probability increases, the colluding nodes must suffer a significant drop in PR weight.

• Therefore, we expect the PR weight of colluding nodes to be highly correlated with 1/ (the average walk length), while that of non-colluding nodes is relatively insensitive to the change in .

GG’

Page 20: The Efficacy of Collusions  in Web Ranking  and the Countermeasurements

04/22/23 USC CS599 20

An intuitive examplenode

referential link

Page 21: The Efficacy of Collusions  in Web Ranking  and the Countermeasurements

04/22/23 USC CS599 21

An intuitive examplenode

referential link

A colluding group

Page 22: The Efficacy of Collusions  in Web Ranking  and the Countermeasurements

04/22/23 USC CS599 22

An intuitive examplenode

referential link

A colluding group

• A colluding node x: PR(x) = , and co-co(PR(x), 1/ ) 1. (co-co: correlation coefficient)

• A non-colluding node y: PR(x) = , and co-co(PR(y), 1/ ) 0.

NKNK1

)(1

NKNK1

)(

x

y

N: the system size; K: the colluding group size; K << N.

Page 23: The Efficacy of Collusions  in Web Ranking  and the Countermeasurements

04/22/23 USC CS599 23

• Part I – collusion detection:Given the topology, calculate the PR vector under different values.

{} = {0.0375, 0.05, 0.075, 0.15, 0.3, 0.45, 0.6}, default = 0.15.

Calculate the correlation coefficient between the curve of each node x's PR weight and the curve of 1/ . Label it as co-co(x).

• Part II – personalization:Calculate each node x's out-link personalized- = F(default, co-co(x)).

Exponential function FExp= .

Linear function FLinear= default+(0.5-default)*co-co(x)

The final PR weight vector is calculated with these personalized resetting values.

))(0.1( xcocodefault

Adaptive-resetting scheme

Page 24: The Efficacy of Collusions  in Web Ranking  and the Countermeasurements

04/22/23 USC CS599 24

Experiment result of Collusion200 (IV)

Figure 5: W - Amplification factors of the 100 colluding groups in Collusion200.

Page 25: The Efficacy of Collusions  in Web Ranking  and the Countermeasurements

04/22/23 USC CS599 25

Experiment result of Collusion200 (VI)

Figure 6: W – new PR rank after Collusion200.

Page 26: The Efficacy of Collusions  in Web Ranking  and the Countermeasurements

04/22/23 USC CS599 26

• Model various colluding subgraphs.• Methodology:

3 colluding groups:node

referential link

G1: 10-node ring G2: 10-node star topology

G3: 2-node ring

Experiment 2: Collusion22

Page 27: The Efficacy of Collusions  in Web Ranking  and the Countermeasurements

04/22/23 USC CS599 27

Experiment result of Collusion22 (I)

Figure 7: Amplification factors of the 3 colluding groups in Collusion22.

Page 28: The Efficacy of Collusions  in Web Ranking  and the Countermeasurements

04/22/23 USC CS599 28

Experiment result of Collusion22 (II)

Figure 8: W – new PR weight after Collusion22.

Page 29: The Efficacy of Collusions  in Web Ranking  and the Countermeasurements

04/22/23 USC CS599 29

New top-25 URL list in W Dropped outDropping New

Page 30: The Efficacy of Collusions  in Web Ranking  and the Countermeasurements

04/22/23 USC CS599 30

Conclusions• Simple collusions lead to effective Web ranking

improvement.

• A simple scheme based on PageRank algorithm effectively counteracts Web ranking collusions.

Page 31: The Efficacy of Collusions  in Web Ranking  and the Countermeasurements

04/22/23 USC CS599 31

Backup slides

Page 32: The Efficacy of Collusions  in Web Ranking  and the Countermeasurements

04/22/23 USC CS599 32

• A means of describing social trust networks.

• The basic concept is a democratic meritocracy.

• A rating system is used to evaluate individual members, and those results are then collated to produce a consensus about the merit of any given member.

• Examples: Livejournal, Friendster, eBay, Advogato

Reputation systems [Okita2003]

Page 33: The Efficacy of Collusions  in Web Ranking  and the Countermeasurements

04/22/23 USC CS599 33

• Assume N pages.• Assign all pages the initial value 1/N• Let Nu be the out-degree of Page u, Rank(v)

the importance of Page v, Bv the set of pages pointing to v. • Basic algorithmv Rank(v) =

vBuuNuRank /)(

• Enhanced algorithm against rank sinksv Rank(v) =

vBu

u NNuRank //)()1(

: damping factor

PageRank algorithm [Brin1998]

Page 34: The Efficacy of Collusions  in Web Ranking  and the Countermeasurements

04/22/23 USC CS599 34

Figure 4: the co-co PDF distribution in W and B: the [0, 0.1] range actually corresponds to [-1, 0.1] range.

Co-co distribution in real-world graphs

Page 35: The Efficacy of Collusions  in Web Ranking  and the Countermeasurements

04/22/23 USC CS599 35

Figure A: W – new PR weight after Collusion200.

Experiment result of Collusion200 (II)

Page 36: The Efficacy of Collusions  in Web Ranking  and the Countermeasurements

04/22/23 USC CS599 36

Figure B: B – new PR rank after Collusion200

Experiment result of Collusion200 (VII)

Page 37: The Efficacy of Collusions  in Web Ranking  and the Countermeasurements

04/22/23 USC CS599 37

Figure C: B – new PR weight after Collusion200

Experiment result of Collusion200 (X)

Page 38: The Efficacy of Collusions  in Web Ranking  and the Countermeasurements

04/22/23 USC CS599 38

Figure 6: W – new PR weight after Collusion200.

Experiment result of Collusion200 (V)

Page 39: The Efficacy of Collusions  in Web Ranking  and the Countermeasurements

04/22/23 USC CS599 39

Correlation coefficient

Page 40: The Efficacy of Collusions  in Web Ranking  and the Countermeasurements

04/22/23 USC CS599 40

Figure D: W – new PR rank after Collusion22.

Experiment result of Collusion22 (III)