Efficient Crawling Through URL Ordering

Junghoo Cho

Hector Garcia-Molina

Lawrence Page

Stanford InfoLab

What is a crawler?

Program that automatically retrieves pages from the Web.

Widely used for search engines.

Challenges

There are many pages out on the Web.

(Major search engines indexed more than 100M pages)

The size of the Web is growing enormously. Most of them are not very interesting

In most cases, it is too costly or not worthwhile to visit the entire Web space.

Good crawling strategy

Make the crawler visit “important pages” first.

Save network bandwidth Save storage space and management

cost Serve quality pages to the client

application

Outline

Importance metrics

: what are important pages? Crawling models

: How is crawler evaluated? Experiments Conclusion & Future work

Importance metric

The metric for determining if a page is HOT

Similarity to driving query Location Metric Backlink count Page Rank

Similarity to a driving query

Importance is measured by closeness of the page to the topic (e.g. the number of the topic word in the page)

Personalized crawler

Example) “Sports”, “Bill Clinton” the pages related to a specific topic

Importance metric

The metric for determining if a page is HOT

Similarity to driving query Location Metric Backlink count Page Rank

Backlink-based metric

Backlink count number of pages pointing to the page Citation metric

Page Rank weighted backlink count weight is iteratively defined

BackLinkCount(F) = 2

PageRank(F) = PageRank(E)/2 + PageRank(C)

Ordering metric

The metric for a crawler to “estimate” the importance of a page

The ordering metric can be different from the importance metric

Crawling models

Crawl and Stop Keep crawling until the local disk

space is full.

Limited buffer crawl Keep crawling until the whole web

space is visited throwing out seemingly unimportant pages.

Crawl and stop model

0% 20% 40% 60% 80% 100%

Perfect!!

Random

% crawled pages (e.g. time)

% crawled HOT pages

Crawling models

Crawl and Stop Keep crawling until the local disk

space is full.

Limited buffer crawl Keep crawling until the whole web

space is visited throwing out seemingly unimportant pages.

Limited buffer model

0% 20% 40% 60% 80% 100%

Perfect!!

% crawled HOT pages

Architecture

RepositoryURL selector

Virtual Crawler

HTML parser

URL poolPage Info

crawled pageextracted URL

page info

selectedURL

WebBaseCrawler

StanfordWWW

Experiments

Backlink-based importance metric backlink count PageRank

Similiarty-based importance metric similarity to a query word

Ordering metrics in experiments

Breadth first order

Backlink count

PageRank

0% 20% 40% 60% 80% 100%

PageRank

backlink

breadth

random

Ordering Metric :

Importance Metric : Backlink Count

% crawled HOT pages

Similarity-based crawling

The content of the page is not available before it is visited

Essentially, the crawler should “guess” the content of the page

More difficult than backlink-based crawling

Promising page

Sports

Anchor Text

Sports!!Sports!!

HOT Parent Page

…/sports.html

Virtual crawler for similarity-based crawling

Promising page Query word appears in its anchor text Query word appears in its URL The page pointing to it is “important” page

Visit “promising pages” first Visit “non-promising pages” in the ordering

metric order

0% 20% 40% 60% 80% 100%

PageRank

backlink

breadth

random

Ordering Metric :

Topic word : "admission"

% crawled HOT pages

(modified ordering metrics)

Conclusion

PageRank is generally good as an ordering metric.

By applying a good ordering metric, it is possible to gather important pages quickly.

Future work

Limited buffer crawling model Replicated page detection Consistency maintenance

Problem

In what order should a crawler visit web pages to get the pages we want?

How can we get important pages first?

WebBase

System for creating and maintaining large local repository

High index speed (50 pages/sec) and large repository (150GB)

Load balancing scheme to prevent servers from crashing

Virtual Web crawler

The crawler for experiments Run on top of the WebBase

repository No load balancing Dataset was restricted to Stanford

domain

Available Information

Anchor text URL of the page The content of the page pointing to it

Efficient Crawling Through URL Ordering

Documents

Albert Crawling

TEMPLATE DESIGN © 2008 Non-URL-Based Crawling strategy : In a RIA one URL corresponds to many states of DOM. Unlike traditional

Crawling and Flying Insects in Albertainsectsofalberta.com/.../crawling-and-flying-insects-in-alberta.pdf · Crawling and Flying Insects in Alberta ... Cersi (apendages) Ovipositor

CS276 Lecture 14 Crawling and web indexes. Today’s lecture Crawling Connectivity servers

Managing Complex · Software Link Analysis. Initial Parameters (e.g. Keyword, URL) Crawling & Recording Quality Parameters . Filtering and Sorting of Data

Advanced Crawling Techniques Chapter 6. Outline Selective Crawling Focused Crawling Distributed Crawling Web Dynamics

Crawling - Northeastern University

CS276 Lecture 17 Crawling and web indexes. Today’s lecture Crawling Connectivity servers

Creeping, crawling critters

Author(s) Doc URL article ... · PDF fileinsights into the mechanism of crawling are important from an evolutionary point of ... Backward locomotion in earthworms and polychaete worms

Information Retrieval and Web Searchweb.eecs.umich.edu/~mihalcea/498IR/Lectures/WebCrawling.pdfProcessing Steps in Crawling • Pick a URL from the frontier • Fetch the document

CS345 Data Mining Crawling the Web - Stanford Universityinfolab.stanford.edu/~ullman/mining/pdf/crawl.pdf · CS345 Data Mining Crawling the Web. Web Crawling Basics get next url get

Crawling and web indexes. Today’s lecture Crawling Connectivity servers

Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam salam@cs.odu.edu AND

Crawling HTML

Crawling and Ranking

Crawling and Walking

Crawling EDGAR - unc.edu

Web Crawling & Crawler

Crawling the world