1
Efficient Crawling Through URL Ordering
Junghoo Cho
Hector Garcia-Molina
Lawrence Page
Stanford InfoLab
2
What is a crawler?
Program that automatically retrieves pages from the Web.
Widely used for search engines.
3
Challenges
There are many pages out on the Web.
(Major search engines indexed more than 100M pages)
The size of the Web is growing enormously. Most of them are not very interesting
In most cases, it is too costly or not worthwhile to visit the entire Web space.
4
Good crawling strategy
Make the crawler visit “important pages” first.
Save network bandwidth Save storage space and management
cost Serve quality pages to the client
application
5
Outline
Importance metrics
: what are important pages? Crawling models
: How is crawler evaluated? Experiments Conclusion & Future work
6
Importance metric
The metric for determining if a page is HOT
Similarity to driving query Location Metric Backlink count Page Rank
7
Similarity to a driving query
Importance is measured by closeness of the page to the topic (e.g. the number of the topic word in the page)
Personalized crawler
Example) “Sports”, “Bill Clinton” the pages related to a specific topic
8
Importance metric
The metric for determining if a page is HOT
Similarity to driving query Location Metric Backlink count Page Rank
9
Backlink-based metric
Backlink count number of pages pointing to the page Citation metric
Page Rank weighted backlink count weight is iteratively defined
10
AB
C
DE
F
BackLinkCount(F) = 2
PageRank(F) = PageRank(E)/2 + PageRank(C)
11
Ordering metric
The metric for a crawler to “estimate” the importance of a page
The ordering metric can be different from the importance metric
12
Crawling models
Crawl and Stop Keep crawling until the local disk
space is full.
Limited buffer crawl Keep crawling until the whole web
space is visited throwing out seemingly unimportant pages.
Crawl and stop model
0%
20%
40%
60%
80%
100%
0% 20% 40% 60% 80% 100%
Perfect!!
Random
Poor
Good
% crawled pages (e.g. time)
% crawled HOT pages
14
Crawling models
Crawl and Stop Keep crawling until the local disk
space is full.
Limited buffer crawl Keep crawling until the whole web
space is visited throwing out seemingly unimportant pages.
Limited buffer model
0%
20%
40%
60%
80%
100%
0% 20% 40% 60% 80% 100%
Perfect!!
Poor
Good
% crawled pages (e.g. time)
% crawled HOT pages
16
Architecture
RepositoryURL selector
Virtual Crawler
HTML parser
URL poolPage Info
crawled pageextracted URL
page info
selectedURL
WebBaseCrawler
StanfordWWW
17
Experiments
Backlink-based importance metric backlink count PageRank
Similiarty-based importance metric similarity to a query word
18
Ordering metrics in experiments
Breadth first order
Backlink count
PageRank
0%
20%
40%
60%
80%
100%
0% 20% 40% 60% 80% 100%
PageRank
backlink
breadth
random
Ordering Metric :
Importance Metric : Backlink Count
% crawled pages (e.g. time)
% crawled HOT pages
20
Similarity-based crawling
The content of the page is not available before it is visited
Essentially, the crawler should “guess” the content of the page
More difficult than backlink-based crawling
21
Promising page
Sports
?
Anchor Text
Sports!!Sports!!
?
HOT Parent Page
?
URL
…/sports.html
22
Virtual crawler for similarity-based crawling
Promising page Query word appears in its anchor text Query word appears in its URL The page pointing to it is “important” page
Visit “promising pages” first Visit “non-promising pages” in the ordering
metric order
0%
20%
40%
60%
80%
100%
0% 20% 40% 60% 80% 100%
PageRank
backlink
breadth
random
Ordering Metric :
Topic word : "admission"
% crawled pages (e.g. time)
% crawled HOT pages
(modified ordering metrics)
24
Conclusion
PageRank is generally good as an ordering metric.
By applying a good ordering metric, it is possible to gather important pages quickly.
25
Future work
Limited buffer crawling model Replicated page detection Consistency maintenance
26
Problem
In what order should a crawler visit web pages to get the pages we want?
How can we get important pages first?
27
WebBase
System for creating and maintaining large local repository
High index speed (50 pages/sec) and large repository (150GB)
Load balancing scheme to prevent servers from crashing
28
Virtual Web crawler
The crawler for experiments Run on top of the WebBase
repository No load balancing Dataset was restricted to Stanford
domain
29
Available Information
Anchor text URL of the page The content of the page pointing to it