25
Outline Motivation Results Conclusions Link Analysis in National Web Domains Ricardo Baeza-Yates and Carlos Castillo ICREA / C´ atedra Telef´ onica, Universitat Pompeu Fabra http://www.upf.edu/dtecn/ OSWIR 2005 Compiegne, France September 19, 2005 Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain Link Analysis in National Web Domains http://www.upf.edu/dtecn/

Link Analysis in National Web Domains (OSWIR 2005 Compiegne)

Embed Size (px)

Citation preview

Page 1: Link Analysis in National Web Domains (OSWIR 2005 Compiegne)

Outline Motivation Results Conclusions

Link Analysis in National Web Domains

Ricardo Baeza-Yates and Carlos Castillo

ICREA / Catedra Telefonica, Universitat Pompeu Fabrahttp://www.upf.edu/dtecn/

OSWIR 2005Compiegne, FranceSeptember 19, 2005

Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain

Link Analysis in National Web Domains http://www.upf.edu/dtecn/

Page 2: Link Analysis in National Web Domains (OSWIR 2005 Compiegne)

Outline Motivation Results Conclusions

1 Motivation

2 Results

3 Conclusions

Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain

Link Analysis in National Web Domains http://www.upf.edu/dtecn/

Page 3: Link Analysis in National Web Domains (OSWIR 2005 Compiegne)

Outline Motivation Results Conclusions

Motivation

Sampling the Web

X We don’t have access to a global-scale collection

X A set of Web sites in the same organization is not diverseenough

X A set of Web sites in the same topic might not berepresentative

X A set of random Web sites might not be connected

V A national domain has a good balance betweendiversity and completeness

Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain

Link Analysis in National Web Domains http://www.upf.edu/dtecn/

Page 4: Link Analysis in National Web Domains (OSWIR 2005 Compiegne)

Outline Motivation Results Conclusions

Motivation

Sampling the Web

X We don’t have access to a global-scale collection

X A set of Web sites in the same organization is not diverseenough

X A set of Web sites in the same topic might not berepresentative

X A set of random Web sites might not be connected

V A national domain has a good balance betweendiversity and completeness

Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain

Link Analysis in National Web Domains http://www.upf.edu/dtecn/

Page 5: Link Analysis in National Web Domains (OSWIR 2005 Compiegne)

Outline Motivation Results Conclusions

Motivation

Sampling the Web

X We don’t have access to a global-scale collection

X A set of Web sites in the same organization is not diverseenough

X A set of Web sites in the same topic might not berepresentative

X A set of random Web sites might not be connected

V A national domain has a good balance betweendiversity and completeness

Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain

Link Analysis in National Web Domains http://www.upf.edu/dtecn/

Page 6: Link Analysis in National Web Domains (OSWIR 2005 Compiegne)

Outline Motivation Results Conclusions

Motivation

Sampling the Web

X We don’t have access to a global-scale collection

X A set of Web sites in the same organization is not diverseenough

X A set of Web sites in the same topic might not berepresentative

X A set of random Web sites might not be connected

V A national domain has a good balance betweendiversity and completeness

Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain

Link Analysis in National Web Domains http://www.upf.edu/dtecn/

Page 7: Link Analysis in National Web Domains (OSWIR 2005 Compiegne)

Outline Motivation Results Conclusions

Motivation

Sampling the Web

X We don’t have access to a global-scale collection

X A set of Web sites in the same organization is not diverseenough

X A set of Web sites in the same topic might not berepresentative

X A set of random Web sites might not be connected

V A national domain has a good balance betweendiversity and completeness

Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain

Link Analysis in National Web Domains http://www.upf.edu/dtecn/

Page 8: Link Analysis in National Web Domains (OSWIR 2005 Compiegne)

Outline Motivation Results Conclusions

Collections used

V Different economical, historical, linguistic, geographicalcontexts

Collection Year

Brazil 2005

Chile 2004

Greece 2004

Indochina 2004

Italy 2004

South Korea 2004

Spain 2004U. K. 2002

Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain

Link Analysis in National Web Domains http://www.upf.edu/dtecn/

Page 9: Link Analysis in National Web Domains (OSWIR 2005 Compiegne)

Outline Motivation Results Conclusions

Collections used

Collection Year Available hosts Pages[mill] (rank) [mill]

Brazil 2005 3.9 11th 4.7

Chile 2004 0.3 42th 3.3

Greece 2004 0.3 40th 3.7

Indochina 2004 0.5 38th 7.4

Italy 2004 9.3 4th 41.3

South Korea 2004 0.2 47th 8.9

Spain 2004 1.3 25th 16.2U. K. 2002 4.4 10th 18.5

Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain

Link Analysis in National Web Domains http://www.upf.edu/dtecn/

Page 10: Link Analysis in National Web Domains (OSWIR 2005 Compiegne)

Outline Motivation Results Conclusions

Scale-free topology

If we sort pages by the number of in-links, the k th pagehas indegree proportional to k−α (Zipf’s Law).

= The fraction of pages with x in-links is proportional tox−θ (Power law). Experimentally, θ ≈ 2.1 on the Web

Partial explanation: a multiplicative process; if dt is thenumber of links at time t, then dt+1 = C × dt .

Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain

Link Analysis in National Web Domains http://www.upf.edu/dtecn/

Page 11: Link Analysis in National Web Domains (OSWIR 2005 Compiegne)

Outline Motivation Results Conclusions

Scale-free topology

If we sort pages by the number of in-links, the k th pagehas indegree proportional to k−α (Zipf’s Law).

= The fraction of pages with x in-links is proportional tox−θ (Power law). Experimentally, θ ≈ 2.1 on the Web

Partial explanation: a multiplicative process; if dt is thenumber of links at time t, then dt+1 = C × dt .

Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain

Link Analysis in National Web Domains http://www.upf.edu/dtecn/

Page 12: Link Analysis in National Web Domains (OSWIR 2005 Compiegne)

Outline Motivation Results Conclusions

Scale-free topology

If we sort pages by the number of in-links, the k th pagehas indegree proportional to k−α (Zipf’s Law).

= The fraction of pages with x in-links is proportional tox−θ (Power law). Experimentally, θ ≈ 2.1 on the Web

Partial explanation: a multiplicative process; if dt is thenumber of links at time t, then dt+1 = C × dt .

Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain

Link Analysis in National Web Domains http://www.upf.edu/dtecn/

Page 13: Link Analysis in National Web Domains (OSWIR 2005 Compiegne)

Outline Motivation Results Conclusions

In-degree

10−710−610−510−410−310−210−1

100 101 102 103 104

Brazil

10−710−610−510−410−310−210−1

100 101 102 103 104

Chile

10−710−610−510−410−310−210−1

100 101 102 103 104

Greece

10−710−610−510−410−310−210−1

100 101 102 103 104

Italy

10−710−610−510−410−310−210−1

100 101 102 103 104

Korea

10−710−610−510−410−310−210−1

100 101 102 103 104

Spain

10−710−610−510−410−310−210−1

100 101 102 103 104

U.K.

Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain

Link Analysis in National Web Domains http://www.upf.edu/dtecn/

Page 14: Link Analysis in National Web Domains (OSWIR 2005 Compiegne)

Outline Motivation Results Conclusions

Out-degree

10−6

10−5

10−4

10−3

10−2

10−1

100 101 102 103

Brazil

10−6

10−5

10−4

10−3

10−2

10−1

100 101 102 103

Chile

10−6

10−5

10−4

10−3

10−2

10−1

100 101 102 103

Greece

10−6

10−5

10−4

10−3

10−2

10−1

100 101 102 103

Italy

10−6

10−5

10−4

10−3

10−2

10−1

100 101 102 103

Korea

10−6

10−5

10−4

10−3

10−2

10−1

100 101 102 103

Spain

10−6

10−5

10−4

10−3

10−2

10−1

100 101 102 103

U.K.

Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain

Link Analysis in National Web Domains http://www.upf.edu/dtecn/

Page 15: Link Analysis in National Web Domains (OSWIR 2005 Compiegne)

Outline Motivation Results Conclusions

Link scores (PageRank, Hubs, Authorities)

10-7

10-6

10-5

10-4

10-3

10-2

10-7 10-6 10-5 10-4

Brazil

10-7

10-6

10-5

10-4

10-3

10-2

10-7 10-6 10-5 10-4

Chile

10-7

10-6

10-5

10-4

10-3

10-2

10-7 10-6 10-5 10-4

Greece

10-7

10-6

10-5

10-4

10-3

10-2

10-7 10-6 10-5 10-4

Korea

10-7

10-6

10-5

10-4

10-3

10-7 10-6 10-5 10-4

Brazil

10-7

10-6

10-5

10-4

10-3

10-7 10-6 10-5 10-4

Chile

10-7

10-6

10-5

10-4

10-3

10-7 10-6 10-5 10-4

Greece

10-7

10-6

10-5

10-4

10-3

10-7 10-6 10-5 10-4

Korea

10-7

10-6

10-5

10-4

10-3

10-7 10-6 10-5 10-4

Brazil

10-7

10-6

10-5

10-4

10-3

10-7 10-6 10-5 10-4

Chile

10-7

10-6

10-5

10-4

10-3

10-7 10-6 10-5 10-4

Greece

10-7

10-6

10-5

10-4

10-3

10-7 10-6 10-5 10-4

Korea

Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain

Link Analysis in National Web Domains http://www.upf.edu/dtecn/

Page 16: Link Analysis in National Web Domains (OSWIR 2005 Compiegne)

Outline Motivation Results Conclusions

Power-law exponents

Collection In- Degree

Brazil 1.9

Chile 2.0

Greece 1.9

Indochina 1.6

Italy 1.8

South Korea 1.9

Spain 2.1U. K. 1.8

(Broder. . . 2000) 2.1(Dill. . . 2002) 2.1(Kleinberg. . . 1999) ≈ 2

Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain

Link Analysis in National Web Domains http://www.upf.edu/dtecn/

Page 17: Link Analysis in National Web Domains (OSWIR 2005 Compiegne)

Outline Motivation Results Conclusions

Power-law exponents

Collection In- Outdegree Page- HITSdegree Small Large Rank Hubs Auth.

Brazil 1.9 0.7 2.7 1.8 2.9 1.8

Chile 2.0 0.7 2.6 1.9 2.7 1.9

Greece 1.9 0.6 1.9 1.8 2.6 1.8

Indochina 1.6 0.7 2.6

Italy 1.8 0.7 2.5

South Korea 1.9 0.3 2.0 1.8 3.7 1.8

Spain 2.1 0.9 4.2 2.0U. K. 1.8 0.7 3.4

(Broder. . . 2000) 2.1 2.7(Dill. . . 2002) 2.1 2.2(Pandurangan. . . 2002) 2.1(Kleinberg. . . 1999) ≈ 2

Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain

Link Analysis in National Web Domains http://www.upf.edu/dtecn/

Page 18: Link Analysis in National Web Domains (OSWIR 2005 Compiegne)

Outline Motivation Results Conclusions

Hostgraph

www.example1.com

www.example2.com

www.example3.com

S1

S2

S3

Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain

Link Analysis in National Web Domains http://www.upf.edu/dtecn/

Page 19: Link Analysis in National Web Domains (OSWIR 2005 Compiegne)

Outline Motivation Results Conclusions

Hostgraph also exhibits a power-law

Hostgraph degreeCollection In Out

Brazil 1.9 1.9

Chile 2.0 1.7

Greece 2.0 1.6

South Korea 1.2 1.4

Spain 1.8 1.3(Bharat. . . 2001) 1.6-1.7 1.7-1.8(Dill. . . 2002) 2.3

Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain

Link Analysis in National Web Domains http://www.upf.edu/dtecn/

Page 20: Link Analysis in National Web Domains (OSWIR 2005 Compiegne)

Outline Motivation Results Conclusions

Web structure: connected components

“Normal” vs “Giant” strongly connected components

10-610-510-410-310-210-1100

100 101 102 103 104 105

Brazil

10-610-510-410-310-210-1100

100 101 102 103 104 105

Chile

10-610-510-410-310-210-1100

100 101 102 103 104 105

Greece

10-610-510-410-310-210-1100

100 101 102 103 104 105

Korea

10-610-510-410-310-210-1100

100 101 102 103 104 105

Spain

Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain

Link Analysis in National Web Domains http://www.upf.edu/dtecn/

Page 21: Link Analysis in National Web Domains (OSWIR 2005 Compiegne)

Outline Motivation Results Conclusions

Conclusions

V Consistent results across collections

V Differences in the amount of spam

V Comparison of other aspects [to be available soon]

Thank you

Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain

Link Analysis in National Web Domains http://www.upf.edu/dtecn/

Page 22: Link Analysis in National Web Domains (OSWIR 2005 Compiegne)

Outline Motivation Results Conclusions

Conclusions

V Consistent results across collections

V Differences in the amount of spam

V Comparison of other aspects [to be available soon]

Thank you

Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain

Link Analysis in National Web Domains http://www.upf.edu/dtecn/

Page 23: Link Analysis in National Web Domains (OSWIR 2005 Compiegne)

Outline Motivation Results Conclusions

Conclusions

V Consistent results across collections

V Differences in the amount of spam

V Comparison of other aspects [to be available soon]

Thank you

Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain

Link Analysis in National Web Domains http://www.upf.edu/dtecn/

Page 24: Link Analysis in National Web Domains (OSWIR 2005 Compiegne)

Outline Motivation Results Conclusions

Conclusions

V Consistent results across collections

V Differences in the amount of spam

V Comparison of other aspects [to be available soon]

Thank you

Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain

Link Analysis in National Web Domains http://www.upf.edu/dtecn/

Page 25: Link Analysis in National Web Domains (OSWIR 2005 Compiegne)

Outline Motivation Results Conclusions

Conclusions

V Consistent results across collections

V Differences in the amount of spam

V Comparison of other aspects [to be available soon]

Thank you

Ricardo Baeza-Yates and Carlos Castillo Universitat Pompeu Fabra - Barcelona, Spain

Link Analysis in National Web Domains http://www.upf.edu/dtecn/