Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
What’s New on the Web?The Evolution of the Web from a Search
Engine Perspective
Alexandros Ntoulas1 Junghoo Cho1 Christopher Olston2
1University of California Los Angeles{ntoulas, cho}@cs.ucla.edu
2Carnegie Mellon [email protected]
World Wide Web Conference, New York, 18th May 2004
What’s New on the Web? A. Ntoulas, J. Cho, C. Olston 1/25
Introduction
Motivation
Search engines crawl the Web to build localindexes
The Web is constantly evolving: pages appear,disappear, change
Search engines need to update their index tokeep up with the evolving Web
How does the Web evolution affect searchengines?
What’s New on the Web? A. Ntoulas, J. Cho, C. Olston 2/25
Introduction
Outline
Experimental SetupWhat’s new on the Web?
Birth, death, replacement of pagesCreation of new contentLink-structure evolution
How much do persisting pages change?Frequency and degree of change
Can we predict the changes?
What’s New on the Web? A. Ntoulas, J. Cho, C. Olston 3/25
Experimental Setup
Data selection and crawling
Picked 5 top-ranked sites from a subset ofGoogle Directory’s topical categories
Crawled pages from 154 sites every week fromOct. 2002 until Oct. 2003Crawled in breadth-first manner until we either:
downloaded all pages from a site, orreached a limit of 200,000 pages (only 4 such sites)
Considered a page unavailable after 3unsuccessful attempts to download it
What’s New on the Web? A. Ntoulas, J. Cho, C. Olston 4/25
Experimental Setup
Data characteristics
Crawled 4.4 million pages on average every week
Weekly crawl size: ≈ 65GB
Total crawl size: ≈ 3.3TB
Meta-data derived from the crawls (links,shingles etc.): ≈ 4TB
What’s New on the Web? A. Ntoulas, J. Cho, C. Olston 5/25
Experimental Setup
Distribution of pages per domain
.com 41%
.gov 18.7%
.edu 16.5% .org 15.7%
.net 4.1%
.mil 2.9%
miscellaneous 2.9%
What’s New on the Web? A. Ntoulas, J. Cho, C. Olston 6/25
What’s new on the Web?
Outline
Experimental SetupWhat’s new on the Web?
Birth, death, replacement of pagesCreation of new contentLink-structure evolution
How much do persisting pages change?Frequency and degree of change
Can we predict the changes?
What’s New on the Web? A. Ntoulas, J. Cho, C. Olston 7/25
What’s new on the Web?
Weekly birth rate of pages
2 5 10 15 20 25 30 35 40 45 50Week
0.02
0.04
0.06
0.08
0.1
0.12
0.14
Fraction ofNew Pages
Average weekly birth rate ≈ 8%
A lot of new pages appear at the end of acalendar month
What’s New on the Web? A. Ntoulas, J. Cho, C. Olston 8/25
What’s new on the Web?
Birth, death and replacement over time
1 5 10 15 20 25 30 35 40 45 50Week
0.2
0.4
0.6
0.8
1
Fractionof Pages
1 5 10 15 20 25 30 35 40 45 50Week
0.2
0.4
0.6
0.8
1
Fractionof Pages
Total number of pages almost constant
Half-life of the pages is about 9 months
Could not find a good fit for the data
What’s New on the Web? A. Ntoulas, J. Cho, C. Olston 9/25
What’s new on the Web?
Creation of new content
We discovered the new pages, but how muchcontent is actually new?We used the shingling technique [FMNW03]
Exclude HTML from the pagesExtract the w-shingles from the pages (w=50)Compare how many shingles exist/disappear overtime
What’s New on the Web? A. Ntoulas, J. Cho, C. Olston 10/25
What’s new on the Web?
Evolution of content over time
1 5 10 15 20 25 30 35 40 45 50Week
0.2
0.4
0.6
0.8
1
1.2
Fractionof Shingles
1 5 10 15 20 25 30 35 40 45 50Week
0.2
0.4
0.6
0.8
1
Fractionof Shingles
Shingles are replaced slower than pages
About 5% of content is new every week
About 62% of the content in new pages isactually new
What’s New on the Web? A. Ntoulas, J. Cho, C. Olston 11/25
What’s new on the Web?
Evolution of the link structure
1 5 10 15 20 25 30 35 40 45 50Week
0.2
0.4
0.6
0.8
1
1.2
Fraction of Links
1 5 10 15 20 25 30 35 40 45 50Week
0.2
0.4
0.6
0.8
1
Fraction of Links
Link structure is significantly more dynamicthan pages
About 25% of new links every week
What’s New on the Web? A. Ntoulas, J. Cho, C. Olston 12/25
How much change?
Outline
Experimental SetupWhat’s new on the Web?
Birth, death, replacement of pagesCreation of new contentLink-structure evolution
How much do persisting pages change?Frequency and degree of change
Can we predict the changes?
What’s New on the Web? A. Ntoulas, J. Cho, C. Olston 13/25
How much change?
Degree of change
Search engines care about degree, not presenceof changeWe measured degree of change using twometrics
TF.IDF Cosine Distance
Dcos(p1, p2) = 1 − v1 · v2
||v1||2||v2||2
Word Distance
Dword(p1, p2) = 1 − 2 · |common words||words in p1| + |words in p2|
What’s New on the Web? A. Ntoulas, J. Cho, C. Olston 14/25
How much change?
Degree of change and frequency of change
10 20 30 40 50No. ofChanges
0.05
0.1
0.15
Degree ofChange
DwordDcos
10 20 30 40 50No. ofChanges
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Cumulative Degreeof Change
DwordDcos
No correlation between frequency and degree ofchange
The same portion of pages changes repeatedly
What’s New on the Web? A. Ntoulas, J. Cho, C. Olston 15/25
Can we predict the change?
Outline
Experimental SetupWhat’s new on the Web?
Birth, death, replacement of pagesCreation of new contentLink-structure evolution
How much do persisting pages change?Frequency and degree of change
Can we predict the changes?
What’s New on the Web? A. Ntoulas, J. Cho, C. Olston 16/25
Can we predict the change?
Can we predict the changes?All Web sites
One Week One Month One Quarter
Group A (Top 80%) Group B (Top 90%) Group C (Top 95%) Group D (Remainder)
Most pages have highly predictable changepatterns
Predictability decreases with longer intervals
Predictability can vary per Web site
What’s New on the Web? A. Ntoulas, J. Cho, C. Olston 17/25
Conclusions
Existing pages on the Web are replaced at ahigh rate
New pages “borrow” content from existing ones
Link structure changes faster than pages
Pages that persist demonstrate minor changes
The past degree of change is a good predictorfor future degree of change
What’s New on the Web? A. Ntoulas, J. Cho, C. Olston 18/25
Related work
Others have studied Webevolution [FMNW03, LWP+01, BC00].
Theoretical work on the Web graphevolution [BKM+00, CDK+99, KRR+00]Our study:
spans over a longer periodfocuses on change metrics which are important forsearch enginesstudies link evolutionexamines predictability of changes
What’s New on the Web? A. Ntoulas, J. Cho, C. Olston 19/25
References
B. E. Brewington and G. Cybenko.How dynamic is the web?In Proceedings of the Ninth International World Wide Web Conference, Amsterdam, The Netherlands, May2000.
A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener.Graph structure in the web.In Proceedings of the Nineth International World Wide Web Conference, Amsterdam, Netherlands, May 2000.
S. Chakrabarti, B. E. Dom, S. Ravi Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, D. Gibson, andJ. Kleinberg.Mining the Web’s link structure.Computer, 32(8):60–67, 1999.
D. Fetterly, M. Manasse, M. Najork, and J. L. Wiener.A large-scale study of the evolution of web pages.In Proceedings of the Twelfth International World Wide Web Conference, Budapest, Hungary, May 2003.
R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, and E. Upfal.Stochastic models for the web graph.In IEEE Symposium on Foundations of Computer Science (FOCS), 2000.
L. Lim, M. Wang, S. Padmanabhan, J. Scott Vitter, and R. C. Agarwal.Characterizing web document change.In Proceedings of the Second International Conference on Advances in Web-Age Information Management,pages 133–144. Springer-Verlag, 2001.
What’s New on the Web? A. Ntoulas, J. Cho, C. Olston 20/25
Thank youQuestions?
What’s New on the Web? A. Ntoulas, J. Cho, C. Olston 21/25
Backup slides
Can we predict the changes?
Web site www.eonline.com
One Week One Month One Quarter
Group A (Top 80%) Group B (Top 90%) Group C (Top 95%) Group D (Remainder)
Less predictable than overall pages
The ability to predict can vary per Web site
What’s New on the Web? A. Ntoulas, J. Cho, C. Olston 22/25
Backup slides
Degree of change and frequency of change
99% confidence interval
10 20 30 40 50No. OfChanges
0.05
0.1
0.15
0.2
Degree ofChange
What’s New on the Web? A. Ntoulas, J. Cho, C. Olston 23/25
Backup slides
Degree of changeTF.IDF Cosine Distance
.05 .1 .2 .3 .4 .5 .6 .7 .8 .9 1Dcos
0.2
0.4
0.6
0.8
1
Fractionof Changes
Amount of change
Word Distance
.05 .1 .2 .3 .4 .5 .6 .7 .8 .9 1Dword
0.2
0.4
0.6
0.8
1
Fractionof Changes
Amount of change
Most of the changes are minor under bothmetrics
A significant number of changes is due tolow-weight words
What’s New on the Web? A. Ntoulas, J. Cho, C. Olston 24/25
Backup slides
Frequency of change
0 2 4 6 8 1012 18 25 50 inf
Avg ChangeInterval
0.1
0.2
0.3
0.4
0.5
Fractionof Pages
Measured change frequency based on “simple”definition of changeMost pages either change often or not at all
What’s New on the Web? A. Ntoulas, J. Cho, C. Olston 25/25