Using Topology to Identify Spam (SIGIR 2007)

Preview:

Citation preview

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Know your NeighborsWeb Spam Detection Using the Web Topology

Carlos Castillo1, Debora Donato1, Aristides Gionis1,Vanessa Murdock1, Fabrizio Silvestri2

1. Yahoo! Research Barcelona – Catalunya, Spain2. ISTI-CNR –Pisa,Italy

ACM SIGIR, 25 July 2007, Amsterdam

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

1 Spam on the Web

2 Detecting Web Spam

3 Link-Based Detection

4 Content-Based Detection

5 Using Links and Contents

6 Using the Web Topology

7 Conclusions

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

1 Spam on the Web2 Detecting Web Spam3 Link-Based Detection4 Content-Based Detection5 Using Links and Contents6 Using the Web Topology7 Conclusions

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

What is on the Web?

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

What is on the Web [2.0]?

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

What else is on the Web?

Source: www.milliondollarhomepage.com

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

What’s happening on the Web?

There is a fierce competition

for your attention

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

What’s happening on the Web?

Search engines are to some extent

arbiters of this competition

and they must watch it closely, otherwise ...

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Some cheating occurs

1986 FIFA World Cup, Argentina vs England

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Simple web spam

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Hidden text

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Made for advertising

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Search engine?

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Fake search engine

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

“Normal” content in link farms

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

There are many attempts of cheating on the Web

Most of these are spam:

1,630,000 results for “free mp3 hilton viagra” in SE1

1,760,000 results for “credit vicodin loan” in SE2

1,320,000 results for “porn mortgage” in SE3

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Costs

Costs:

X Costs for users: lower precision for some queries

X Costs for search engines: wasted storage space,network resources, and processing cycles

X Costs for the publishers: resources invested in cheatingand not in improving their contents

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Cheating on the Web

Z Link spam

Z Content spam

Spam-oriented blogging

Comment/forum/Wiki spam

Malicious cloaking

Click fraud ×2

Malicious tagging

. . . more?

Adversarial relationship

Every undeserved gain in ranking for a spammer, is a loss ofprecision for the search engine.

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Cheating on the Web

Z Link spam

Z Content spam

Spam-oriented blogging

Comment/forum/Wiki spam

Malicious cloaking

Click fraud ×2

Malicious tagging

. . . more?

Adversarial relationship

Every undeserved gain in ranking for a spammer, is a loss ofprecision for the search engine.

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Cheating on the Web

Z Link spam

Z Content spam

Spam-oriented blogging

Comment/forum/Wiki spam

Malicious cloaking

Click fraud ×2

Malicious tagging

. . . more?

Adversarial relationship

Every undeserved gain in ranking for a spammer, is a loss ofprecision for the search engine.

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Cheating on the Web

Z Link spam

Z Content spam

Spam-oriented blogging

Comment/forum/Wiki spam

Malicious cloaking

Click fraud ×2

Malicious tagging

. . . more?

Adversarial relationship

Every undeserved gain in ranking for a spammer, is a loss ofprecision for the search engine.

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Cheating on the Web

Z Link spam

Z Content spam

Spam-oriented blogging

Comment/forum/Wiki spam

Malicious cloaking

Click fraud ×2

Malicious tagging

. . . more?

Adversarial relationship

Every undeserved gain in ranking for a spammer, is a loss ofprecision for the search engine.

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Cheating on the Web

Z Link spam

Z Content spam

Spam-oriented blogging

Comment/forum/Wiki spam

Malicious cloaking

Click fraud ×2

Malicious tagging

. . . more?

Adversarial relationship

Every undeserved gain in ranking for a spammer, is a loss ofprecision for the search engine.

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Cheating on the Web

Z Link spam

Z Content spam

Spam-oriented blogging

Comment/forum/Wiki spam

Malicious cloaking

Click fraud ×2

Malicious tagging

. . . more?

Adversarial relationship

Every undeserved gain in ranking for a spammer, is a loss ofprecision for the search engine.

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Cheating on the Web

Z Link spam

Z Content spam

Spam-oriented blogging

Comment/forum/Wiki spam

Malicious cloaking

Click fraud ×2

Malicious tagging

. . . more?

Adversarial relationship

Every undeserved gain in ranking for a spammer, is a loss ofprecision for the search engine.

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Cheating on the Web

Z Link spam

Z Content spam

Spam-oriented blogging

Comment/forum/Wiki spam

Malicious cloaking

Click fraud ×2

Malicious tagging

. . . more?

Adversarial relationship

Every undeserved gain in ranking for a spammer, is a loss ofprecision for the search engine.

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Cheating on the Web

Z Link spam

Z Content spam

Spam-oriented blogging

Comment/forum/Wiki spam

Malicious cloaking

Click fraud ×2

Malicious tagging

. . . more?

Adversarial relationship

Every undeserved gain in ranking for a spammer, is a loss ofprecision for the search engine.

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

1 Spam on the Web2 Detecting Web Spam3 Link-Based Detection4 Content-Based Detection5 Using Links and Contents6 Using the Web Topology7 Conclusions

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Research on Web spam detection

Web spam detection techniques

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Spam, damn spam and statistics

[Fetterly et al., 2004] propose to study statisticaldistributions: “in a number of these distributions, outliervalues are associated with web spam”

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Machine learning training

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Machine learning

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Challenges

Scalability+ Machine Learning Challenges:

Instances are not really independent (graph)

Training set is relatively small

+ Information Retrieval Challenges:

It is hard to find out which features are relevant

Features can be aggregated in content units:page/host/domain

Features can be propagated through the graph

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Challenges

Scalability+ Machine Learning Challenges:

Instances are not really independent (graph)

Training set is relatively small

+ Information Retrieval Challenges:

It is hard to find out which features are relevant

Features can be aggregated in content units:page/host/domain

Features can be propagated through the graph

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Challenges

Scalability+ Machine Learning Challenges:

Instances are not really independent (graph)

Training set is relatively small

+ Information Retrieval Challenges:

It is hard to find out which features are relevant

Features can be aggregated in content units:page/host/domain

Features can be propagated through the graph

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Challenges

Scalability+ Machine Learning Challenges:

Instances are not really independent (graph)

Training set is relatively small

+ Information Retrieval Challenges:

It is hard to find out which features are relevant

Features can be aggregated in content units:page/host/domain

Features can be propagated through the graph

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Challenges

Scalability+ Machine Learning Challenges:

Instances are not really independent (graph)

Training set is relatively small

+ Information Retrieval Challenges:

It is hard to find out which features are relevant

Features can be aggregated in content units:page/host/domain

Features can be propagated through the graph

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Challenges

Scalability+ Machine Learning Challenges:

Instances are not really independent (graph)

Training set is relatively small

+ Information Retrieval Challenges:

It is hard to find out which features are relevant

Features can be aggregated in content units:page/host/domain

Features can be propagated through the graph

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Challenges

Scalability+ Machine Learning Challenges:

Instances are not really independent (graph)

Training set is relatively small

+ Information Retrieval Challenges:

It is hard to find out which features are relevant

Features can be aggregated in content units:page/host/domain

Features can be propagated through the graph

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Challenges

Scalability+ Machine Learning Challenges:

Instances are not really independent (graph)

Training set is relatively small

+ Information Retrieval Challenges:

It is hard to find out which features are relevant

Features can be aggregated in content units:page/host/domain

Features can be propagated through the graph

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Challenges

Scalability+ Machine Learning Challenges:

Instances are not really independent (graph)

Training set is relatively small

+ Information Retrieval Challenges:

It is hard to find out which features are relevant

Features can be aggregated in content units:page/host/domain

Features can be propagated through the graph

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Training data

X It is hard for search engines to provide labeled data

X Even if they do, it will not reflect a consensus on what isWeb Spam

V Public Web Spam collection built by a group ofvolunteers: http://www.yr-bcn.es/webspam/

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Training data

X It is hard for search engines to provide labeled data

X Even if they do, it will not reflect a consensus on what isWeb Spam

V Public Web Spam collection built by a group ofvolunteers: http://www.yr-bcn.es/webspam/

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Training data

X It is hard for search engines to provide labeled data

X Even if they do, it will not reflect a consensus on what isWeb Spam

V Public Web Spam collection built by a group ofvolunteers: http://www.yr-bcn.es/webspam/

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

1 Spam on the Web2 Detecting Web Spam3 Link-Based Detection4 Content-Based Detection5 Using Links and Contents6 Using the Web Topology7 Conclusions

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

“Link farms”

Web

Link farm

Spam page

Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al., 2005]

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Handling large-graphs

Memory size enough to hold some data per-node

Disk size enough to hold some data per-edge

A small number of passes over the data

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Semi-streaming model

1: for node : 1 . . . N do2: INITIALIZE-MEM(node)3: end for4: for distance : 1 . . . d do {Iteration step}5: for src : 1 . . . N do {Follow links in the graph}6: for all links from src to dest do7: COMPUTE(src,dest)8: end for9: end for

10: NORMALIZE11: end for12: POST-PROCESS13: return Something

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Semi-streaming model

1: for node : 1 . . . N do2: INITIALIZE-MEM(node)3: end for4: for distance : 1 . . . d do {Iteration step}5: for src : 1 . . . N do {Follow links in the graph}6: for all links from src to dest do7: COMPUTE(src,dest)8: end for9: end for

10: NORMALIZE11: end for12: POST-PROCESS13: return Something

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Semi-streaming model

1: for node : 1 . . . N do2: INITIALIZE-MEM(node)3: end for4: for distance : 1 . . . d do {Iteration step}5: for src : 1 . . . N do {Follow links in the graph}6: for all links from src to dest do7: COMPUTE(src,dest)8: end for9: end for

10: NORMALIZE11: end for12: POST-PROCESS13: return Something

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Link-based features

Degree-related measures

PageRank

TrustRank [Gyongyi et al., 2004]

Truncated PageRank [Becchetti et al., 2006]

Estimation of supporters [Becchetti et al., 2006]

140 features per host (2 pages per host)

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Degree-Based

0.00

0.02

0.04

0.06

0.08

0.10

0.12

1968753460609107764252125899138032376184

NormalSpam

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

22009.92686.5327.940.04.90.60.10.00.00.0

NormalSpam

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

TrustRank

[Gyongyi et al., 2004]

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

TrustRank / PageRank

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

9e+033e+031e+033e+021e+024e+011e+01410.4

NormalSpam

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Truncated PageRank

Proposed in [Becchetti et al., 2006]. Idea: reduce the directcontribution of the first levels of links:

damping(t) =

{0 t ≤ T

Cαt t > T

V No extra reading of the graph after PageRank

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Truncated PageRank

Proposed in [Becchetti et al., 2006]. Idea: reduce the directcontribution of the first levels of links:

damping(t) =

{0 t ≤ T

Cαt t > T

V No extra reading of the graph after PageRank

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Hop-plot

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

High and low-ranked pages are different

1 5 10 15 200

2

4

6

8

10

12

x 104

Distance

Num

ber o

f Nod

es

Top 0%−10%Top 40%−50%Top 60%−70%

Areas below the curves are equal if we are in the samestrongly-connected component

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

High and low-ranked pages are different

1 5 10 15 200

2

4

6

8

10

12

x 104

Distance

Num

ber o

f Nod

es

Top 0%−10%Top 40%−50%Top 60%−70%

Areas below the curves are equal if we are in the samestrongly-connected component

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

“OR” operation

100010

[Becchetti et al., 2006] shows an improvement of ANFalgorithm [Palmer et al., 2002] based on probabilisticcounting [Flajolet and Martin, 1985]

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

“OR” operation

100010

[Becchetti et al., 2006] shows an improvement of ANFalgorithm [Palmer et al., 2002] based on probabilisticcounting [Flajolet and Martin, 1985]

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Bottleneck number

bd(x) = minj≤d

{|Nj (x)||Nj−1(x)|

}.

Minimum rate of growth of the neighbors of x up to a certaindistance.

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Bottleneck number: spam

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Bottleneck number: normal

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Bottleneck number

bd(x) = minj≤d{|Nj(x)|/|Nj−1(x)|}.

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

4.523.873.312.832.422.071.781.521.301.11

NormalSpam

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

1 Spam on the Web2 Detecting Web Spam3 Link-Based Detection4 Content-Based Detection5 Using Links and Contents6 Using the Web Topology7 Conclusions

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Content-Based Features

Most of the features reported in [Ntoulas et al., 2006]

Number of word in the page and title

Average word length

Fraction of anchor text

Fraction of visible text

Compression rate

Corpus precision and corpus recall

Query precision and query recall

Independent trigram likelihood

Entropy of trigrams

96 features per host

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Content-based features (entropy related)

T = {(w1, p1), . . . , (wk , pk)} the set of trigrams in a page,

where trigram wi has frequency pi

Features:

Entropy of trigrams H = −∑

wi∈T pi log pi

Also, compression rate, as measured by bzip

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Content-based features (related to popularkeywords)

F set of most frequent terms in the collection

Q set of most frequent terms in a query log

P set of terms in a page

Features:

Corpus “precision” |P ∩ F |/|P|Corpus “recall” |P ∩ F |/|F |Query “precision” |P ∩ Q|/|P|Query “recall” |P ∩ Q|/|Q|

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Average word length

0.00

0.02

0.04

0.06

0.08

0.10

0.12

3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5

NormalSpam

Figure: Histogram of the average word length in non-spam vs.spam pages for k = 500.

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Corpus precision

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.10

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

NormalSpam

Figure: Histogram of the corpus precision in non-spam vs. spampages.

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Query precision

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.0 0.1 0.2 0.3 0.4 0.5 0.6

NormalSpam

Figure: Histogram of the query precision in non-spam vs. spampages for k = 500.

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

1 Spam on the Web2 Detecting Web Spam3 Link-Based Detection4 Content-Based Detection5 Using Links and Contents6 Using the Web Topology7 Conclusions

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Cost-sensitive decision tree with bagging

Bagging of 10 decision trees, asymmetrical costs.

Cost ratio 1 10 20 30 50

True positive rate 65.8% 66.7% 71.1% 78.7% 84.1%False positive rate 2.8% 3.4% 4.5% 5.7% 8.6%

F-Measure 0.712 0.703 0.704 0.723 0.692

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Link- and content-based features

Link-based and content-based

Both Link-only Content-only

True positive rate 78.7% 79.4% 64.9%False positive rate 5.7% 9.0% 3.7%

F-Measure 0.723 0.659 0.683

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

1 Spam on the Web2 Detecting Web Spam3 Link-Based Detection4 Content-Based Detection5 Using Links and Contents6 Using the Web Topology7 Conclusions

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Hypothesis

Pages topologically close to each other are more likelyto have the same label (spam/nonspam) than randompairs of pages.

Pages linked together are more likely to be on the same topicthan random pairs of pages [Davison, 2000]

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Hypothesis

Pages topologically close to each other are more likelyto have the same label (spam/nonspam) than randompairs of pages.

Pages linked together are more likely to be on the same topicthan random pairs of pages [Davison, 2000]

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Topological dependencies: in-links

Histogram of fraction of spam hosts in the in-links

0 = no in-link comes from spam hosts

1 = all of the in-links come from spam hosts

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.0 0.2 0.4 0.6 0.8 1.0

In-links of non spamIn-links of spam

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Topological dependencies: out-links

Histogram of fraction of spam hosts in the out-links

0 = none of the out-links points to spam hosts

1 = all of the out-links point to spam hosts

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.0 0.2 0.4 0.6 0.8 1.0

Out-links of non spamOutlinks of spam

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Idea 1: Clustering

Classify, then cluster hosts, then assign the same label to allhosts in the same cluster by majority voting

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Idea 1: Clustering (cont.)

Initial prediction:

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Idea 1: Clustering (cont.)

Clustering:

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Idea 1: Clustering (cont.)

Final prediction:

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Idea 1: Clustering – Results

Baseline Clustering

Without bagging

True positive rate 75.6% 74.5%False positive rate 8.5% 6.8%

F-Measure 0.646 0.673With bagging

True positive rate 78.7% 76.9%False positive rate 5.7% 5.0%

F-Measure 0.723 0.728

V Reduces error rate

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Idea 2: Propagate the label

Classify, then interpret “spamicity” as a probability, then do arandom walk with restart from those nodes

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Idea 2: Propagate the label (cont.)

Initial prediction:

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Idea 2: Propagate the label (cont.)

Propagation:

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Idea 2: Propagate the label (cont.)

Final prediction, applying a threshold:

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Idea 2: Propagate the label – Results

Baseline Fwds. Backwds. Both

Classifier without bagging

True positive rate 75.6% 70.9% 69.4% 71.4%False positive rate 8.5% 6.1% 5.8% 5.8%

F-Measure 0.646 0.665 0.664 0.676Classifier with bagging

True positive rate 78.7% 76.5% 75.0% 75.2%False positive rate 5.7% 5.4% 4.3% 4.7%

F-Measure 0.723 0.716 0.733 0.724

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Idea 3: Stacked graphical learning

Meta-learning scheme [Cohen and Kou, 2006]

Derive initial predictions

Generate an additional attribute for each object bycombining predictions on neighbors in the graph

Append additional attribute in the data and retrain

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Idea 3: Stacked graphical learning (cont.)

Let p(x) ∈ [0..1] be the prediction of a classificationalgorithm for a host x using k features

Let N(x) be the set of pages related to x (in some way)

Compute

f (x) =

∑g∈N(x) p(g)

|N(x)|Add f (x) as an extra feature for instance x and learn anew model with k + 1 features

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Idea 3: Stacked graphical learning (cont.)

Initial prediction:

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Idea 3: Stacked graphical learning (cont.)

Computation of new feature:

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Idea 3: Stacked graphical learning (cont.)

New prediction with k + 1 features:

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Idea 3: Stacked graphical learning - Results

Avg. Avg. Avg.Baseline of in of out of both

True positive rate 78.7% 84.4% 78.3% 85.2%False positive rate 5.7% 6.7% 4.8% 6.1%

F-Measure 0.723 0.733 0.742 0.750

V Increases detection rate

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Idea 3: Stacked graphical learning x2

And repeat ...

Baseline First pass Second pass

True positive rate 78.7% 85.2% 88.4%False positive rate 5.7% 6.1% 6.3%

F-Measure 0.723 0.750 0.763

V Significant improvement over the baseline

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

1 Spam on the Web2 Detecting Web Spam3 Link-Based Detection4 Content-Based Detection5 Using Links and Contents6 Using the Web Topology7 Conclusions

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Concluding remarks

V Considering content-based and link-based attributesimproves the accuracy of the classifier

V Considering the links among pages improves the accuracy

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Concluding remarks

V Considering content-based and link-based attributesimproves the accuracy of the classifier

V Considering the links among pages improves the accuracy

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

i Web Spam Dataset: http://www.yr-bcn.es/webspam/

i Web Spam Challenge I & II: http://webspam.lip6.fr/

i AIRWeb Workshop: http://airweb.cse.lehigh.edu/

i GraphLab at ECML/PKDD: http://graphlab.lip6.fr/

B Newsletter: webspam-announces@yahoogroups.com

Thank you!

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

i Web Spam Dataset: http://www.yr-bcn.es/webspam/

i Web Spam Challenge I & II: http://webspam.lip6.fr/

i AIRWeb Workshop: http://airweb.cse.lehigh.edu/

i GraphLab at ECML/PKDD: http://graphlab.lip6.fr/

B Newsletter: webspam-announces@yahoogroups.com

Thank you!

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Becchetti, L., Castillo, C., Donato, D., Leonardi, S., and Baeza-Yates, R.(2006).Using rank propagation and probabilistic counting for link-based spamdetection.In Proceedings of the Workshop on Web Mining and Web Usage Analysis(WebKDD), Pennsylvania, USA. ACM Press.

Cohen, W. W. and Kou, Z. (2006).Stacked graphical learning: approximating learning in markov randomfields using very short inhomogeneous markov chains.Technical report.

Davison, B. D. (2000).Topical locality in the web.In Proceedings of the 23rd annual international ACM SIGIR conference onresearch and development in information retrieval, pages 272–279, Athens,Greece. ACM Press.

Fetterly, D., Manasse, M., and Najork, M. (2004).Spam, damn spam, and statistics: Using statistical analysis to locate spamweb pages.In Proceedings of the seventh workshop on the Web and databases(WebDB), pages 1–6, Paris, France.

Flajolet, P. and Martin, N. G. (1985).Probabilistic counting algorithms for data base applications.Journal of Computer and System Sciences, 31(2):182–209.

Web SpamDetection

C. Castillo,D. Donato,A. Gionis,

V. Murdock,F. Silvestri

Spam on theWeb

Detecting WebSpam

Link-BasedDetection

Content-BasedDetection

Using Links andContents

Using the WebTopology

Conclusions

Gibson, D., Kumar, R., and Tomkins, A. (2005).Discovering large dense subgraphs in massive graphs.In VLDB ’05: Proceedings of the 31st international conference on Verylarge data bases, pages 721–732. VLDB Endowment.

Gyongyi, Z., Garcia-Molina, H., and Pedersen, J. (2004).Combating Web spam with TrustRank.In Proceedings of the 30th International Conference on Very Large DataBases (VLDB), pages 576–587, Toronto, Canada. Morgan Kaufmann.

Ntoulas, A., Najork, M., Manasse, M., and Fetterly, D. (2006).Detecting spam web pages through content analysis.In Proceedings of the World Wide Web conference, pages 83–92,Edinburgh, Scotland.

Palmer, C. R., Gibbons, P. B., and Faloutsos, C. (2002).ANF: a fast and scalable tool for data mining in massive graphs.In Proceedings of the eighth ACM SIGKDD international conference onKnowledge discovery and data mining, pages 81–90, New York, NY, USA.ACM Press.

Recommended