29
1 Efficient Crawling Through URL Ordering Junghoo Cho Hector Garcia-Molina Lawrence Page Stanford InfoLab

Efficient Crawling Through URL Ordering

  • Upload
    jewell

  • View
    27

  • Download
    0

Embed Size (px)

DESCRIPTION

Junghoo Cho Hector Garcia-Molina Lawrence Page Stanford InfoLab. Efficient Crawling Through URL Ordering. What is a crawler?. Program that automatically retrieves pages from the Web. Widely used for search engines. Challenges. There are many pages out on the Web. - PowerPoint PPT Presentation

Citation preview

Page 1: Efficient Crawling Through URL Ordering

1

Efficient Crawling Through URL Ordering

Junghoo Cho

Hector Garcia-Molina

Lawrence Page

Stanford InfoLab

Page 2: Efficient Crawling Through URL Ordering

2

What is a crawler?

Program that automatically retrieves pages from the Web.

Widely used for search engines.

Page 3: Efficient Crawling Through URL Ordering

3

Challenges

There are many pages out on the Web.

(Major search engines indexed more than 100M pages)

The size of the Web is growing enormously. Most of them are not very interesting

In most cases, it is too costly or not worthwhile to visit the entire Web space.

Page 4: Efficient Crawling Through URL Ordering

4

Good crawling strategy

Make the crawler visit “important pages” first.

Save network bandwidth Save storage space and management

cost Serve quality pages to the client

application

Page 5: Efficient Crawling Through URL Ordering

5

Outline

Importance metrics

: what are important pages? Crawling models

: How is crawler evaluated? Experiments Conclusion & Future work

Page 6: Efficient Crawling Through URL Ordering

6

Importance metric

The metric for determining if a page is HOT

Similarity to driving query Location Metric Backlink count Page Rank

Page 7: Efficient Crawling Through URL Ordering

7

Similarity to a driving query

Importance is measured by closeness of the page to the topic (e.g. the number of the topic word in the page)

Personalized crawler

Example) “Sports”, “Bill Clinton” the pages related to a specific topic

Page 8: Efficient Crawling Through URL Ordering

8

Importance metric

The metric for determining if a page is HOT

Similarity to driving query Location Metric Backlink count Page Rank

Page 9: Efficient Crawling Through URL Ordering

9

Backlink-based metric

Backlink count number of pages pointing to the page Citation metric

Page Rank weighted backlink count weight is iteratively defined

Page 10: Efficient Crawling Through URL Ordering

10

AB

C

DE

F

BackLinkCount(F) = 2

PageRank(F) = PageRank(E)/2 + PageRank(C)

Page 11: Efficient Crawling Through URL Ordering

11

Ordering metric

The metric for a crawler to “estimate” the importance of a page

The ordering metric can be different from the importance metric

Page 12: Efficient Crawling Through URL Ordering

12

Crawling models

Crawl and Stop Keep crawling until the local disk

space is full.

Limited buffer crawl Keep crawling until the whole web

space is visited throwing out seemingly unimportant pages.

Page 13: Efficient Crawling Through URL Ordering

Crawl and stop model

0%

20%

40%

60%

80%

100%

0% 20% 40% 60% 80% 100%

Perfect!!

Random

Poor

Good

% crawled pages (e.g. time)

% crawled HOT pages

Page 14: Efficient Crawling Through URL Ordering

14

Crawling models

Crawl and Stop Keep crawling until the local disk

space is full.

Limited buffer crawl Keep crawling until the whole web

space is visited throwing out seemingly unimportant pages.

Page 15: Efficient Crawling Through URL Ordering

Limited buffer model

0%

20%

40%

60%

80%

100%

0% 20% 40% 60% 80% 100%

Perfect!!

Poor

Good

% crawled pages (e.g. time)

% crawled HOT pages

Page 16: Efficient Crawling Through URL Ordering

16

Architecture

RepositoryURL selector

Virtual Crawler

HTML parser

URL poolPage Info

crawled pageextracted URL

page info

selectedURL

WebBaseCrawler

StanfordWWW

Page 17: Efficient Crawling Through URL Ordering

17

Experiments

Backlink-based importance metric backlink count PageRank

Similiarty-based importance metric similarity to a query word

Page 18: Efficient Crawling Through URL Ordering

18

Ordering metrics in experiments

Breadth first order

Backlink count

PageRank

Page 19: Efficient Crawling Through URL Ordering

0%

20%

40%

60%

80%

100%

0% 20% 40% 60% 80% 100%

PageRank

backlink

breadth

random

Ordering Metric :

Importance Metric : Backlink Count

% crawled pages (e.g. time)

% crawled HOT pages

Page 20: Efficient Crawling Through URL Ordering

20

Similarity-based crawling

The content of the page is not available before it is visited

Essentially, the crawler should “guess” the content of the page

More difficult than backlink-based crawling

Page 21: Efficient Crawling Through URL Ordering

21

Promising page

Sports

?

Anchor Text

Sports!!Sports!!

?

HOT Parent Page

?

URL

…/sports.html

Page 22: Efficient Crawling Through URL Ordering

22

Virtual crawler for similarity-based crawling

Promising page Query word appears in its anchor text Query word appears in its URL The page pointing to it is “important” page

Visit “promising pages” first Visit “non-promising pages” in the ordering

metric order

Page 23: Efficient Crawling Through URL Ordering

0%

20%

40%

60%

80%

100%

0% 20% 40% 60% 80% 100%

PageRank

backlink

breadth

random

Ordering Metric :

Topic word : "admission"

% crawled pages (e.g. time)

% crawled HOT pages

(modified ordering metrics)

Page 24: Efficient Crawling Through URL Ordering

24

Conclusion

PageRank is generally good as an ordering metric.

By applying a good ordering metric, it is possible to gather important pages quickly.

Page 25: Efficient Crawling Through URL Ordering

25

Future work

Limited buffer crawling model Replicated page detection Consistency maintenance

Page 26: Efficient Crawling Through URL Ordering

26

Problem

In what order should a crawler visit web pages to get the pages we want?

How can we get important pages first?

Page 27: Efficient Crawling Through URL Ordering

27

WebBase

System for creating and maintaining large local repository

High index speed (50 pages/sec) and large repository (150GB)

Load balancing scheme to prevent servers from crashing

Page 28: Efficient Crawling Through URL Ordering

28

Virtual Web crawler

The crawler for experiments Run on top of the WebBase

repository No load balancing Dataset was restricted to Stanford

domain

Page 29: Efficient Crawling Through URL Ordering

29

Available Information

Anchor text URL of the page The content of the page pointing to it