Efficient Crawling Through URL Ordering

1

Efficient Crawling Through URL Ordering

Junghoo Cho

Hector Garcia-Molina

Lawrence Page

Stanford InfoLab

2

What is a crawler?

Program that automatically retrieves pages from the Web.

Widely used for search engines.

3

Challenges

There are many pages out on the Web.

(Major search engines indexed more than 100M pages)

The size of the Web is growing enormously. Most of them are not very interesting

In most cases, it is too costly or not worthwhile to visit the entire Web space.

4

Good crawling strategy

Make the crawler visit “important pages” first.

Save network bandwidth Save storage space and management

cost Serve quality pages to the client

application

5

Outline

Importance metrics

: what are important pages? Crawling models

: How is crawler evaluated? Experiments Conclusion & Future work

6

Importance metric

The metric for determining if a page is HOT

Similarity to driving query Location Metric Backlink count Page Rank

7

Similarity to a driving query

Importance is measured by closeness of the page to the topic (e.g. the number of the topic word in the page)

Personalized crawler

Example) “Sports”, “Bill Clinton” the pages related to a specific topic

8

Importance metric

The metric for determining if a page is HOT

Similarity to driving query Location Metric Backlink count Page Rank

9

Backlink-based metric

Backlink count number of pages pointing to the page Citation metric

Page Rank weighted backlink count weight is iteratively defined

10

AB

C

DE

F

BackLinkCount(F) = 2

PageRank(F) = PageRank(E)/2 + PageRank(C)

11

Ordering metric

The metric for a crawler to “estimate” the importance of a page

The ordering metric can be different from the importance metric

12

Crawling models

Crawl and Stop Keep crawling until the local disk

space is full.

Limited buffer crawl Keep crawling until the whole web

space is visited throwing out seemingly unimportant pages.

Crawl and stop model

0%

20%

40%

60%

80%

100%

0% 20% 40% 60% 80% 100%

Perfect!!

Random

Poor

Good

% crawled pages (e.g. time)

% crawled HOT pages

14

Crawling models

Crawl and Stop Keep crawling until the local disk

space is full.

Limited buffer crawl Keep crawling until the whole web

space is visited throwing out seemingly unimportant pages.

Limited buffer model

0%

20%

40%

60%

80%

100%

0% 20% 40% 60% 80% 100%

Perfect!!

Poor

Good


% crawled HOT pages

16

Architecture

RepositoryURL selector

Virtual Crawler

HTML parser

URL poolPage Info

crawled pageextracted URL

page info

selectedURL

WebBaseCrawler

StanfordWWW

17

Experiments

Backlink-based importance metric backlink count PageRank

Similiarty-based importance metric similarity to a query word

18

Ordering metrics in experiments

Breadth first order

Backlink count

PageRank

0%

20%

40%

60%

80%

100%

0% 20% 40% 60% 80% 100%

PageRank

backlink

breadth

random

Ordering Metric :

Importance Metric : Backlink Count


% crawled HOT pages

20

Similarity-based crawling

The content of the page is not available before it is visited

Essentially, the crawler should “guess” the content of the page

More difficult than backlink-based crawling

21

Promising page

Sports

?

Anchor Text

Sports!!Sports!!

?

HOT Parent Page

?

URL

…/sports.html

22

Virtual crawler for similarity-based crawling

Promising page Query word appears in its anchor text Query word appears in its URL The page pointing to it is “important” page

Visit “promising pages” first Visit “non-promising pages” in the ordering

metric order

0%

20%

40%

60%

80%

100%

0% 20% 40% 60% 80% 100%

PageRank

backlink

breadth

random

Ordering Metric :

Topic word : "admission"


% crawled HOT pages

(modified ordering metrics)

24

Conclusion

PageRank is generally good as an ordering metric.

By applying a good ordering metric, it is possible to gather important pages quickly.

25

Future work

Limited buffer crawling model Replicated page detection Consistency maintenance

26

Problem

In what order should a crawler visit web pages to get the pages we want?

How can we get important pages first?

27

WebBase

System for creating and maintaining large local repository

High index speed (50 pages/sec) and large repository (150GB)

Load balancing scheme to prevent servers from crashing

28

Virtual Web crawler

The crawler for experiments Run on top of the WebBase

repository No load balancing Dataset was restricted to Stanford

domain

29

Available Information

Anchor text URL of the page The content of the page pointing to it

Documents

Efficient Crawling Through URL Ordering