30
Sitemaps: Above and Beyond the Crawl of Duty Sitemaps! Sitemaps! Uri Schonfeld (Google and UCLA) Narayanan Shivakumar (Google) Copyright Uri Schonfeld, shuri.org April 2009

Inside Google's Search Algorythm! (by Google Researchers)

Embed Size (px)

DESCRIPTION

Comprehensive coverage of the public web is crucial to web search engines. Search engines use crawlers to retrieve pages and then discover new ones by extracting the pages’ outgoing links. However, the set of pages reachable from the publicly linked web is estimated to be significantly smaller than the invisible web [5], the set of documents that have no incoming links and can only be retrieved through web applications and web forms. The Sitemaps protocol is a fast-growing web protocol supported jointly by major search engines to help content creators and search engines unlock this hidden data by making it available to search engines. In this paper, we perform a detailed study of how “classic” discovery crawling compares with Sitemaps, in key measures such as coverage and freshness over key representative websites as well as over billions of URLs seen at Google. We observe that Sitemaps and discovery crawling complement each other very well, and offer different tradeoffs. Categories and Subject Descriptors: H.3.3: Information Search and Retrieval. General Terms: Experimentation, Algorithms. Keywords: search engines, crawling, sitemaps, metrics, quality.

Citation preview

Page 1: Inside Google's Search Algorythm! (by Google Researchers)

Sitemaps: Above and Beyond the Crawl of Duty

Sitemaps! Sitemaps!

Uri Schonfeld (Google and UCLA)Narayanan Shivakumar (Google)

Copyright Uri Schonfeld, shuri.org April 2009

Page 2: Inside Google's Search Algorythm! (by Google Researchers)

What are we going to talk about?

• The sitemaps protocol:– Not introduced in this paper– Friendly web servers publishing URL lists

• Popular and growing in popularity

• First large scale study over real data:

• How it is used by users• Its Impact

– First look at how it can be used by search engines

– Lots of future work to get excited over

• Let’s start with:– Underlying problem that sitemaps addresses

Copyright Uri Schonfeld, shuri.org April 2009

Page 3: Inside Google's Search Algorythm! (by Google Researchers)

Dream of the Perfect Crawl

1.Users Have High Expectations:• Coverage: Every page should be findable• Freshness: Latest event, viral video,... • Deep Web: ajax, flash, silverlight,....

2.Search Engines Dream of the perfect crawl:• Everything the users want• …but efficient:

– No 404s– No duplicates

3.Sitemaps to the rescue... Copyright Uri Schonfeld, shuri.org April

2009

Page 4: Inside Google's Search Algorythm! (by Google Researchers)

Sitemaps

1. Basic idea: The web server 1.Puts a URL list, a sitemaps file, on its site2.Includes new and changed content3.Lets the search engines know

2. The URL list may also include: URLs Last Modification Time Expected Change Frequency Priority

3. Let the search engine know:1."Ping" search engines that their sitemaps  file has changed2.Alternatively include sitemaps in robots.txt file (April 2007)

Copyright Uri Schonfeld, shuri.org April 2009

Page 5: Inside Google's Search Algorythm! (by Google Researchers)

Sitemaps: This is how it looks

<?xml version="1.0" encoding="UTF-8"?><urlset xmlns=  "http://www.sitemaps.org/schemas/sitemap/0.9">   <url>      <loc>http://www.example.com/</loc>      <lastmod>2005-01-01</lastmod>      <changefreq>monthly</changefreq>      <priority>0.8</priority>   </url>   <url>       ...   <url></urlset>

Copyright Uri Schonfeld, shuri.org April 2009

Page 6: Inside Google's Search Algorythm! (by Google Researchers)

Related Work1. 1999: "Santa Fe Convention"

1.Lead to OAI-PMH2."...e-print servers to expose metadata for the papers it

held" 3.Coalition for Networked Information, Digital Library

Federation, Open Archives Initiative (OAI), Herbert Van de Sompel, Carl Lagoze

2. 2000: Crawler Friendly Web Servers: Brandman, Cho, Garcia-Molina and Shivakumar

1.Export list of URLs and changed content3. 2005/6: Sitemaps:

1.Introduced in 2005 by Google2.2006 Microsoft, Yahoo and Google announced joint

support

Copyright Uri Schonfeld, shuri.org April 2009

Page 7: Inside Google's Search Algorythm! (by Google Researchers)

Our Main Contributions

1. First Study of Sitemaps over real world data:a) How it is usedb) It’s impact

2. Define metrics to evaluate Sitemaps feeds.3. Explore:

a) The challenges of using Sitemaps together with Discovery Crawl

b) Define a preliminary algorithm combining the two crawls.Copyright Uri Schonfeld, shuri.org April

2009

Page 8: Inside Google's Search Algorythm! (by Google Researchers)

Inside Google

1. Sitemaps & Discovery2. Sitemaps:

a) Sitemaps are fetched:• After they are pinged.• Several frequencies.

b) Sitemaps discovered URLs are fed to the crawling pipeline.c) Some sources are fed directly for instant crawling. 

3. Discovery: a) New URLs and URLs of changed content are fed back to the

pipeline 4. Pipeline 

Copyright Uri Schonfeld, shuri.org April 2009

Page 9: Inside Google's Search Algorythm! (by Google Researchers)

How Sitemaps Is Used?

1. Approximately 35M websites publish Sitemaps, andgive us metadata for several billions of URLs.

2. Metadata:1. 61% include a priority field. 2. 58% of URLs include a lastmodification date3. 7% include a change frequency field

3. Formats Breakdown: a) XML Sitemap 76.76 b) Url List 3.42 c) Atom 1.61 d) RSS 0.11e) Unknown 17.51 

4. Robots.txt announced April 2007

Copyright Uri Schonfeld, shuri.org April 2009

Page 10: Inside Google's Search Algorythm! (by Google Researchers)

SitemapsCase Studies

Copyright Uri Schonfeld, shuri.org April 2009

Page 11: Inside Google's Search Algorythm! (by Google Researchers)

Sitemaps Use Case Studies

1. Looked at three different sites: a) Amazon: Large.b) CNN: Dynamic.c) Pubmedcentral.nih.gov: Archival.

2. Amazon:a) Huge.b) Service Oriented Architecture:

• Hard to list valid URLs, when content changes• Research Opportunity: Auto Generation of Sitemaps

c) 20M URLs published in:• 10,000 sitemaps files.• Each file: 20,000-50,000 URLs.• Log based.

d) Efficiency: URLs crawled vs unique pages• Discovery 63%, Sitemaps 86%.

Copyright Uri Schonfeld, shuri.org April 2009

Page 12: Inside Google's Search Algorythm! (by Google Researchers)

Case Study: CNN

1. Very Dynamic:a) Many new URLs added daily

2. Sitemaps:a) News: 200-400 URLs

b) Weekly:2500-3000 URLs

c) Monthly:5000-10000 URLs

d) The lists don't overlap but complete 

e) Additional SitemapsIndex of hub pages

Copyright Uri Schonfeld, shuri.org April 2009

Page 13: Inside Google's Search Algorythm! (by Google Researchers)

Case StudyPubmedcentral.nih.gov

1. Archival domain: a) Add and hardly change.b) Oldest journal published in1809.

2. Thus, can be exhaustive. 3. Sitemap files:

a) 50+ sitemaps files.b) 30,000 URLs in each. c) Last modification inaccurate (unlike

CNN and Amazon).Copyright Uri Schonfeld, shuri.org April 2009

Page 14: Inside Google's Search Algorythm! (by Google Researchers)

Pubmedcentral.nih.gov (cont’)

1. URL break downa) Discovery and Sitemaps 3 millionb) Sitemaps only 1.7 millionc) 1 million due to duplicates

2. Manually examined 3000 sample URLs from the missing ~300,000

a) 8% errors b) 10% redirectsc) 11% other duplicate contentd) 51% judgment call needed (should crawl or

not)

Copyright Uri Schonfeld, shuri.org April 2009

Page 15: Inside Google's Search Algorythm! (by Google Researchers)

Pubmedcentral

 

Copyright Uri Schonfeld, shuri.org April 2009

Page 16: Inside Google's Search Algorythm! (by Google Researchers)

CNN: New URLs Seen Over Time

 

Copyright Uri Schonfeld, shuri.org April 2009

Page 17: Inside Google's Search Algorythm! (by Google Researchers)

Evaluating Sitemaps

Copyright Uri Schonfeld, shuri.org April 2009

Page 18: Inside Google's Search Algorythm! (by Google Researchers)

Evaluating Sitemaps

1. Coverage and Freshness 2. How should we judge usefulness? 3.  How far does a URL get in our pipeline:

1. Seen2. Crawled3. Unique4. Indexed5. Results6. Clicked

4. UniqueCoverage = UniqueSitemaps(D) / Unique(D)5. IndexCoverage = IndexedSitemaps(D) / Indexed(D)6. PageRankCoverage = RankMassSitemaps(D) / RankMass(D)

Copyright Uri Schonfeld, shuri.org April 2009

Page 19: Inside Google's Search Algorythm! (by Google Researchers)

Coverage

Copyright Uri Schonfeld, shuri.org April 2009

Page 20: Inside Google's Search Algorythm! (by Google Researchers)

Coverage vs UniqueCoverage

 

Copyright Uri Schonfeld, shuri.org April 2009

Page 21: Inside Google's Search Algorythm! (by Google Researchers)

UniqueCoverage vs Domain Size

 • 46% domains

have above 50% UniqueCoverage 

• 12% domains have 90% UniqueCoverage.

Copyright Uri Schonfeld, shuri.org April 2009

Page 22: Inside Google's Search Algorythm! (by Google Researchers)

While PageRank Coverage…

 

Copyright Uri Schonfeld, shuri.org April 2009

Page 23: Inside Google's Search Algorythm! (by Google Researchers)

Bang for the Buck…

 

Copyright Uri Schonfeld, shuri.org April 2009

Page 24: Inside Google's Search Algorythm! (by Google Researchers)

Pings and Freshness

First Seen by Sitemaps • Ping: 12.7%• Non-Ping: 80.3%

First Seen by Discovery • Ping: 1.5%• Non-Ping: 5.5%

• 14.2% Discovered through pings.• But who saw first is independent.• Doesn't reflect the potential.Research Opportunity: Detect and ping policy• Of URLs seen by both Sitemaps and Discovery.

o 78% Seen first by Sitemaps o 22% Seen first by Discovery

Copyright Uri Schonfeld, shuri.org April 2009

Page 25: Inside Google's Search Algorythm! (by Google Researchers)

Doing Both :Sitemaps and Discovery

1. New URLs and Refresh: we’ll talk new URLs. 2. You can't fetch it all per site quota.3.What to fetch?4. Crawl uses some ranking.5. What should ranking for Sitemaps URLs?6. How to balance between them?

Copyright Uri Schonfeld, shuri.org April 2009

Page 26: Inside Google's Search Algorythm! (by Google Researchers)

Ranking URLs in Sitemaps

1. Priority:1.Full autority to the webmaster.2.Is not available all the time.

2. PageRank:1. Provenly effective.2.Not available for the truly new pages.3.Webmasters don't have a Say at all.

3. PriorityRank:1.Modify graph to take both into account 2.Add sitemaps as a page implicitly linked to from the root.3.Links from Sitemaps are weighted by priority if

available 4.Calculate PageRank over this modified graph.5.Hybrid of the two previous methods .

Copyright Uri Schonfeld, shuri.org April 2009

Page 27: Inside Google's Search Algorythm! (by Google Researchers)

Balancing the Crawl: Algorithm Simplified1. for epoch in 0..infinity do2. kD = kS = 1/2 

1.Fetch:1.Top kD * Quota from Discovery2.Top kS * Quota from Sitemaps

2.Measure derivative of the utility (IndexCoverage)3.Adjust kC and KS

Copyright Uri Schonfeld, shuri.org April 2009

Page 28: Inside Google's Search Algorythm! (by Google Researchers)

Conclusion and Future Work

1. Large scale study, real data2. You cannot stop Discovery… yet.3. Presented metrics for freshness and coverage.4. Sitemaps evaluated for coverage and freshness.5. Presented Algorithm to combine Sitemaps & Discovery6. To Be Done

1. Good news: tons of future work2. Duplicates not solved on web-server side either. 3. Better Pings.4. Ranking Sitemaps URLs can be a challenge.

Copyright Uri Schonfeld, shuri.org April 2009

Page 29: Inside Google's Search Algorythm! (by Google Researchers)

Acks

We wish to thank many Googlers!thank...Dennis Geels, Ori Gershony, Laramie, Madhu, Thomal, Alkis, Peter Dickman, Arup, Charlie, Nish, Rosemary, Ralph, Nikhil.

Copyright Uri Schonfeld, shuri.org April 2009

Page 30: Inside Google's Search Algorythm! (by Google Researchers)

The End

Thank You!

Copyright Uri Schonfeld, shuri.org April 2009