25
What’s New on the Web? The Evolution of the Web from a Search Engine Perspective Alexandros Ntoulas 1 Junghoo Cho 1 Christopher Olston 2 1 University of California Los Angeles {ntoulas, cho}@cs.ucla.edu 2 Carnegie Mellon University [email protected] World Wide Web Conference, New York, 18th May 2004 What’s New on the Web? A. Ntoulas, J. Cho, C. Olston 1/25

What's New on the Web? · The Evolution of the Web from a Search Engine Perspective Alexandros Ntoulas 1Junghoo Cho Christopher Olston2 1University of California Los Angeles {ntoulas,

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: What's New on the Web? · The Evolution of the Web from a Search Engine Perspective Alexandros Ntoulas 1Junghoo Cho Christopher Olston2 1University of California Los Angeles {ntoulas,

What’s New on the Web?The Evolution of the Web from a Search

Engine Perspective

Alexandros Ntoulas1 Junghoo Cho1 Christopher Olston2

1University of California Los Angeles{ntoulas, cho}@cs.ucla.edu

2Carnegie Mellon [email protected]

World Wide Web Conference, New York, 18th May 2004

What’s New on the Web? A. Ntoulas, J. Cho, C. Olston 1/25

Page 2: What's New on the Web? · The Evolution of the Web from a Search Engine Perspective Alexandros Ntoulas 1Junghoo Cho Christopher Olston2 1University of California Los Angeles {ntoulas,

Introduction

Motivation

Search engines crawl the Web to build localindexes

The Web is constantly evolving: pages appear,disappear, change

Search engines need to update their index tokeep up with the evolving Web

How does the Web evolution affect searchengines?

What’s New on the Web? A. Ntoulas, J. Cho, C. Olston 2/25

Page 3: What's New on the Web? · The Evolution of the Web from a Search Engine Perspective Alexandros Ntoulas 1Junghoo Cho Christopher Olston2 1University of California Los Angeles {ntoulas,

Introduction

Outline

Experimental SetupWhat’s new on the Web?

Birth, death, replacement of pagesCreation of new contentLink-structure evolution

How much do persisting pages change?Frequency and degree of change

Can we predict the changes?

What’s New on the Web? A. Ntoulas, J. Cho, C. Olston 3/25

Page 4: What's New on the Web? · The Evolution of the Web from a Search Engine Perspective Alexandros Ntoulas 1Junghoo Cho Christopher Olston2 1University of California Los Angeles {ntoulas,

Experimental Setup

Data selection and crawling

Picked 5 top-ranked sites from a subset ofGoogle Directory’s topical categories

Crawled pages from 154 sites every week fromOct. 2002 until Oct. 2003Crawled in breadth-first manner until we either:

downloaded all pages from a site, orreached a limit of 200,000 pages (only 4 such sites)

Considered a page unavailable after 3unsuccessful attempts to download it

What’s New on the Web? A. Ntoulas, J. Cho, C. Olston 4/25

Page 5: What's New on the Web? · The Evolution of the Web from a Search Engine Perspective Alexandros Ntoulas 1Junghoo Cho Christopher Olston2 1University of California Los Angeles {ntoulas,

Experimental Setup

Data characteristics

Crawled 4.4 million pages on average every week

Weekly crawl size: ≈ 65GB

Total crawl size: ≈ 3.3TB

Meta-data derived from the crawls (links,shingles etc.): ≈ 4TB

What’s New on the Web? A. Ntoulas, J. Cho, C. Olston 5/25

Page 6: What's New on the Web? · The Evolution of the Web from a Search Engine Perspective Alexandros Ntoulas 1Junghoo Cho Christopher Olston2 1University of California Los Angeles {ntoulas,

Experimental Setup

Distribution of pages per domain

.com 41%

.gov 18.7%

.edu 16.5% .org 15.7%

.net 4.1%

.mil 2.9%

miscellaneous 2.9%

What’s New on the Web? A. Ntoulas, J. Cho, C. Olston 6/25

Page 7: What's New on the Web? · The Evolution of the Web from a Search Engine Perspective Alexandros Ntoulas 1Junghoo Cho Christopher Olston2 1University of California Los Angeles {ntoulas,

What’s new on the Web?

Outline

Experimental SetupWhat’s new on the Web?

Birth, death, replacement of pagesCreation of new contentLink-structure evolution

How much do persisting pages change?Frequency and degree of change

Can we predict the changes?

What’s New on the Web? A. Ntoulas, J. Cho, C. Olston 7/25

Page 8: What's New on the Web? · The Evolution of the Web from a Search Engine Perspective Alexandros Ntoulas 1Junghoo Cho Christopher Olston2 1University of California Los Angeles {ntoulas,

What’s new on the Web?

Weekly birth rate of pages

2 5 10 15 20 25 30 35 40 45 50Week

0.02

0.04

0.06

0.08

0.1

0.12

0.14

Fraction ofNew Pages

Average weekly birth rate ≈ 8%

A lot of new pages appear at the end of acalendar month

What’s New on the Web? A. Ntoulas, J. Cho, C. Olston 8/25

Page 9: What's New on the Web? · The Evolution of the Web from a Search Engine Perspective Alexandros Ntoulas 1Junghoo Cho Christopher Olston2 1University of California Los Angeles {ntoulas,

What’s new on the Web?

Birth, death and replacement over time

1 5 10 15 20 25 30 35 40 45 50Week

0.2

0.4

0.6

0.8

1

Fractionof Pages

1 5 10 15 20 25 30 35 40 45 50Week

0.2

0.4

0.6

0.8

1

Fractionof Pages

Total number of pages almost constant

Half-life of the pages is about 9 months

Could not find a good fit for the data

What’s New on the Web? A. Ntoulas, J. Cho, C. Olston 9/25

Page 10: What's New on the Web? · The Evolution of the Web from a Search Engine Perspective Alexandros Ntoulas 1Junghoo Cho Christopher Olston2 1University of California Los Angeles {ntoulas,

What’s new on the Web?

Creation of new content

We discovered the new pages, but how muchcontent is actually new?We used the shingling technique [FMNW03]

Exclude HTML from the pagesExtract the w-shingles from the pages (w=50)Compare how many shingles exist/disappear overtime

What’s New on the Web? A. Ntoulas, J. Cho, C. Olston 10/25

Page 11: What's New on the Web? · The Evolution of the Web from a Search Engine Perspective Alexandros Ntoulas 1Junghoo Cho Christopher Olston2 1University of California Los Angeles {ntoulas,

What’s new on the Web?

Evolution of content over time

1 5 10 15 20 25 30 35 40 45 50Week

0.2

0.4

0.6

0.8

1

1.2

Fractionof Shingles

1 5 10 15 20 25 30 35 40 45 50Week

0.2

0.4

0.6

0.8

1

Fractionof Shingles

Shingles are replaced slower than pages

About 5% of content is new every week

About 62% of the content in new pages isactually new

What’s New on the Web? A. Ntoulas, J. Cho, C. Olston 11/25

Page 12: What's New on the Web? · The Evolution of the Web from a Search Engine Perspective Alexandros Ntoulas 1Junghoo Cho Christopher Olston2 1University of California Los Angeles {ntoulas,

What’s new on the Web?

Evolution of the link structure

1 5 10 15 20 25 30 35 40 45 50Week

0.2

0.4

0.6

0.8

1

1.2

Fraction of Links

1 5 10 15 20 25 30 35 40 45 50Week

0.2

0.4

0.6

0.8

1

Fraction of Links

Link structure is significantly more dynamicthan pages

About 25% of new links every week

What’s New on the Web? A. Ntoulas, J. Cho, C. Olston 12/25

Page 13: What's New on the Web? · The Evolution of the Web from a Search Engine Perspective Alexandros Ntoulas 1Junghoo Cho Christopher Olston2 1University of California Los Angeles {ntoulas,

How much change?

Outline

Experimental SetupWhat’s new on the Web?

Birth, death, replacement of pagesCreation of new contentLink-structure evolution

How much do persisting pages change?Frequency and degree of change

Can we predict the changes?

What’s New on the Web? A. Ntoulas, J. Cho, C. Olston 13/25

Page 14: What's New on the Web? · The Evolution of the Web from a Search Engine Perspective Alexandros Ntoulas 1Junghoo Cho Christopher Olston2 1University of California Los Angeles {ntoulas,

How much change?

Degree of change

Search engines care about degree, not presenceof changeWe measured degree of change using twometrics

TF.IDF Cosine Distance

Dcos(p1, p2) = 1 − v1 · v2

||v1||2||v2||2

Word Distance

Dword(p1, p2) = 1 − 2 · |common words||words in p1| + |words in p2|

What’s New on the Web? A. Ntoulas, J. Cho, C. Olston 14/25

Page 15: What's New on the Web? · The Evolution of the Web from a Search Engine Perspective Alexandros Ntoulas 1Junghoo Cho Christopher Olston2 1University of California Los Angeles {ntoulas,

How much change?

Degree of change and frequency of change

10 20 30 40 50No. ofChanges

0.05

0.1

0.15

Degree ofChange

DwordDcos

10 20 30 40 50No. ofChanges

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Cumulative Degreeof Change

DwordDcos

No correlation between frequency and degree ofchange

The same portion of pages changes repeatedly

What’s New on the Web? A. Ntoulas, J. Cho, C. Olston 15/25

Page 16: What's New on the Web? · The Evolution of the Web from a Search Engine Perspective Alexandros Ntoulas 1Junghoo Cho Christopher Olston2 1University of California Los Angeles {ntoulas,

Can we predict the change?

Outline

Experimental SetupWhat’s new on the Web?

Birth, death, replacement of pagesCreation of new contentLink-structure evolution

How much do persisting pages change?Frequency and degree of change

Can we predict the changes?

What’s New on the Web? A. Ntoulas, J. Cho, C. Olston 16/25

Page 17: What's New on the Web? · The Evolution of the Web from a Search Engine Perspective Alexandros Ntoulas 1Junghoo Cho Christopher Olston2 1University of California Los Angeles {ntoulas,

Can we predict the change?

Can we predict the changes?All Web sites

One Week One Month One Quarter

Group A (Top 80%) Group B (Top 90%) Group C (Top 95%) Group D (Remainder)

Most pages have highly predictable changepatterns

Predictability decreases with longer intervals

Predictability can vary per Web site

What’s New on the Web? A. Ntoulas, J. Cho, C. Olston 17/25

Page 18: What's New on the Web? · The Evolution of the Web from a Search Engine Perspective Alexandros Ntoulas 1Junghoo Cho Christopher Olston2 1University of California Los Angeles {ntoulas,

Conclusions

Existing pages on the Web are replaced at ahigh rate

New pages “borrow” content from existing ones

Link structure changes faster than pages

Pages that persist demonstrate minor changes

The past degree of change is a good predictorfor future degree of change

What’s New on the Web? A. Ntoulas, J. Cho, C. Olston 18/25

Page 19: What's New on the Web? · The Evolution of the Web from a Search Engine Perspective Alexandros Ntoulas 1Junghoo Cho Christopher Olston2 1University of California Los Angeles {ntoulas,

Related work

Others have studied Webevolution [FMNW03, LWP+01, BC00].

Theoretical work on the Web graphevolution [BKM+00, CDK+99, KRR+00]Our study:

spans over a longer periodfocuses on change metrics which are important forsearch enginesstudies link evolutionexamines predictability of changes

What’s New on the Web? A. Ntoulas, J. Cho, C. Olston 19/25

Page 20: What's New on the Web? · The Evolution of the Web from a Search Engine Perspective Alexandros Ntoulas 1Junghoo Cho Christopher Olston2 1University of California Los Angeles {ntoulas,

References

B. E. Brewington and G. Cybenko.How dynamic is the web?In Proceedings of the Ninth International World Wide Web Conference, Amsterdam, The Netherlands, May2000.

A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener.Graph structure in the web.In Proceedings of the Nineth International World Wide Web Conference, Amsterdam, Netherlands, May 2000.

S. Chakrabarti, B. E. Dom, S. Ravi Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, D. Gibson, andJ. Kleinberg.Mining the Web’s link structure.Computer, 32(8):60–67, 1999.

D. Fetterly, M. Manasse, M. Najork, and J. L. Wiener.A large-scale study of the evolution of web pages.In Proceedings of the Twelfth International World Wide Web Conference, Budapest, Hungary, May 2003.

R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, and E. Upfal.Stochastic models for the web graph.In IEEE Symposium on Foundations of Computer Science (FOCS), 2000.

L. Lim, M. Wang, S. Padmanabhan, J. Scott Vitter, and R. C. Agarwal.Characterizing web document change.In Proceedings of the Second International Conference on Advances in Web-Age Information Management,pages 133–144. Springer-Verlag, 2001.

What’s New on the Web? A. Ntoulas, J. Cho, C. Olston 20/25

Page 21: What's New on the Web? · The Evolution of the Web from a Search Engine Perspective Alexandros Ntoulas 1Junghoo Cho Christopher Olston2 1University of California Los Angeles {ntoulas,

Thank youQuestions?

What’s New on the Web? A. Ntoulas, J. Cho, C. Olston 21/25

Page 22: What's New on the Web? · The Evolution of the Web from a Search Engine Perspective Alexandros Ntoulas 1Junghoo Cho Christopher Olston2 1University of California Los Angeles {ntoulas,

Backup slides

Can we predict the changes?

Web site www.eonline.com

One Week One Month One Quarter

Group A (Top 80%) Group B (Top 90%) Group C (Top 95%) Group D (Remainder)

Less predictable than overall pages

The ability to predict can vary per Web site

What’s New on the Web? A. Ntoulas, J. Cho, C. Olston 22/25

Page 23: What's New on the Web? · The Evolution of the Web from a Search Engine Perspective Alexandros Ntoulas 1Junghoo Cho Christopher Olston2 1University of California Los Angeles {ntoulas,

Backup slides

Degree of change and frequency of change

99% confidence interval

10 20 30 40 50No. OfChanges

0.05

0.1

0.15

0.2

Degree ofChange

What’s New on the Web? A. Ntoulas, J. Cho, C. Olston 23/25

Page 24: What's New on the Web? · The Evolution of the Web from a Search Engine Perspective Alexandros Ntoulas 1Junghoo Cho Christopher Olston2 1University of California Los Angeles {ntoulas,

Backup slides

Degree of changeTF.IDF Cosine Distance

.05 .1 .2 .3 .4 .5 .6 .7 .8 .9 1Dcos

0.2

0.4

0.6

0.8

1

Fractionof Changes

Amount of change

Word Distance

.05 .1 .2 .3 .4 .5 .6 .7 .8 .9 1Dword

0.2

0.4

0.6

0.8

1

Fractionof Changes

Amount of change

Most of the changes are minor under bothmetrics

A significant number of changes is due tolow-weight words

What’s New on the Web? A. Ntoulas, J. Cho, C. Olston 24/25

Page 25: What's New on the Web? · The Evolution of the Web from a Search Engine Perspective Alexandros Ntoulas 1Junghoo Cho Christopher Olston2 1University of California Los Angeles {ntoulas,

Backup slides

Frequency of change

0 2 4 6 8 1012 18 25 50 inf

Avg ChangeInterval

0.1

0.2

0.3

0.4

0.5

Fractionof Pages

Measured change frequency based on “simple”definition of changeMost pages either change often or not at all

What’s New on the Web? A. Ntoulas, J. Cho, C. Olston 25/25