Web Search – Summer Term 2006 IV. Web Search - Crawling (part 2) (c) Wolfgang Hürst, Albert-Ludwigs-University

Web Search – Summer Term 2006

IV. Web Search -Crawling (part 2)

(c) Wolfgang Hürst, Albert-Ludwigs-University

Crawling - Recap from last timeGeneral procedure: Continuously process a list of URLs and collect respective web pages and links that come along

Two problems: Size and frequent changes

Page selection:

Based on metrics, i.e.

- Importance Metric (goal)

- Ordering Metric (selection)

- Quality Metric (evaluation)

Experimental verification with a representative test collection



Page selection:








Page selection:








Page selection:








Page selection:






Page refresh:

Estimating rate of change: see last lecture (Note: other studies exist, e.g. [5])

Observations:- Frequent changes- Significant differences, e.g. among domains

Hence: Update rule necessary

3. Page Refresh (Update Rules)Problem: The web is continuously changingGoal: Index and update pages in a way that keeps the index as fresh and young as possible (given the limited resources)

Distinguish between

Periodic crawlers: Download K pages and stop, repeat this after some time t, and replace old with new collection

Incremental crawlers: Continuously crawl the web and incrementally update your collection

3.2 Incremental Crawlers

Freshness of a page pi at time t

Freshness of a local collection P at time t

Main Goal: Keep local collection up-to-date

Two measures: Freshness and Age


Age of a page pi at time t

Age of a local collection P at time t




Time average of freshness of page pi at t

Time average of freshness of a local collection P at time t

(Time average of age: analogous)



Example for Freshness and Age

1

0

0

ELEMENT IS

CHANGED

SYNCHRONIZED

AGE

FRESHNESS

(SOURCE: [6])

Design alternative 1: Batch mode vs. steady crawler

Batch mode crawler:Periodic update of all pages of a collection

Steady crawler:Continuous update

BATCH MODE CRAWLER STEADY CRAWLER

FRESHNESS

TIME (MONTH)TIME (MONTH)

FRESHNESS

Note: Assuming a distribution of Poisson, we can prove that the average freshness over time is identical in both cases (for the same average crawling speed!)

Design alternative 2: In-place vs. shadowing

Replace old with new version of a page in-place or via shadowing, i.e. after all versions of one crawl have been downloaded

Shadowing keeps two collections: The crawlers collection and the current collection

BATCH MODE CRAWLER STEADY CRAWLER

Design alternative 3: Fixed vs. variable frequency

Fixed frequency / uniform refresh policy: Same access rate to all pages (independent of their actual rate of change)

Variable frequency: Access pages depending on their rate of change

Example: Proportional refresh policy

Variable frequency update

Obvious assumption for a good strategy: Visit a page that changes frequently more often

Wrong!!!

The optimum update strategy (if we assume a distribution of Poisson) looks like this:

RATE OF CHANGE OF A PAGE

OPTIM

UM

U

PD

ATE

TIM

E

Variable frequency update (cont.)

Why is this a better strategy?

Illustration with a simple example:

P1

P2

Summary of different design alternatives

Steady

In-place update

Variable frequency

Batch-mode

Shadowing

Fixed frequency

vs.

vs.

vs.

3.3 Expl. for an Incremental Crawler

Two main goals:

- Keep the local collection fresh Regular, best-possible updates of the pages in the index

- Continuously improve the quality of the collection Replace existing pages with low quality through new pages with higher quality


WHILE (TRUE)

URL = SELECT_TO_CRAWL (ALL_URLS);

PAGE = CRAWL (URL);

IF (URL IN COLL_URLS) THEN UPDATE (URL, PAGE) ELSE TMP_URL = SELECT_TO_DISCARD (COL_URLS); DISCARD (TMP_URL); SAVE (URL, PAGE); COLL_URLS = (COLL_URLS - {TMP_URL}) U {URL}

NEW_URLS = EXTRACT_URLS (PAGE);

ALL_URLS = ALL_URLS U NEW_URLS;


ALL_URLSCOLL_URLS

ADD_URLS

UPDATE/SAVE

COLLECTION

RANKINGMODULE

DISCARDSCAN

SCANADD/REMOVE

CRAWLMODULE

UPDATEMODULE

CRAWLCHECK SUM

POP

PUSH BACK

References - Web Crawler[1] A. ARASU, J. CHO, H. GARCIA-MOLINA, A. PAEPCKE, S. RAGHAVAN: "SEARCHING THE WEB", ACM TRANSACTIONS ON INTERNET TECHNOLOGY, VOL 1/1, AUG. 2001Chapter 2 (Crawling web pages)

[2] S. BRIN, L. PAGE: "THE ANATOMY OF A LARGE-SCALE HYPERTEXTUAL WEB SEARCH ENGINE", WWW 1998Chapter 4.3 (Crawling the web)

[3] CHO, GARCIA-MOLINA, PAGE: "EFFICIENT CRAWLING THROUGH URL ORDERING", WWW 1998

[4] CHO, GARCIA-MOLINA: "THE EVOLUTION OF THE WEB AND IMPLICATIONS FOR AN INCREMENTAL CRAWLER", PROCEEDINGS OF THE 26th INTL. CONF. ON VERY LARGE DATA BASES (VLDB 2000)

[5] FETTERLY, MANASSE, NAJORK, WIENER: "A LARGE-SCALE STUDY OF THE EVOLUTION OF WEB PAGES", WWW 2003

[6] CHO, GARCIA-MOLINA: "SYNCHRONIZING A DATABASE TO IMPROVE FRESHNESS", ACM SIGMOD 2000

Documents

Web Search – Summer Term 2006 IV. Web Search - Crawling (part 2) (c) Wolfgang Hürst, Albert-Ludwigs-University