35
1 CS 430 / INFO 430: Information Discovery Lecture 20 Web Search 2

CS 430 / INFO 430: Information Discovery

  • Upload
    duena

  • View
    43

  • Download
    2

Embed Size (px)

DESCRIPTION

CS 430 / INFO 430: Information Discovery. Lecture 20 Web Search 2. Course Administration. • . Example: Heritrix Crawler. A high-performance, open source crawler for production and research Developed by the Internet Archive and others - PowerPoint PPT Presentation

Citation preview

Page 1: CS 430 / INFO 430:  Information Discovery

1

CS 430 / INFO 430: Information Discovery

Lecture 20

Web Search 2

Page 2: CS 430 / INFO 430:  Information Discovery

2

Course Administration

Page 3: CS 430 / INFO 430:  Information Discovery

3

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 4: CS 430 / INFO 430:  Information Discovery

4

Example: Heritrix Crawler

A high-performance, open source crawler for production and research

Developed by the Internet Archive and others

Before Heritrix, Cornell computer science used the Mercator web crawler for experiments in selective web crawling (automated collection development). Mercator was developed by Allan Heydon, Marc Njork and colleagues at Compaq Systems Research Center. This was continuation of work of Digital's AltaVista group.

Page 5: CS 430 / INFO 430:  Information Discovery

5

Heritrix: Design Goals

Broad crawling: Large, high-bandwidth crawls to sample as much of the web as possible given the time, bandwidth, and storage resources available.

Focused crawling: Small- to medium-sized crawls (usually less than 10 million unique documents) in which the quality criterion is complete coverage of selected sites or topics.

Continuous crawling: Crawls that revisit previously fetched pages, looking for changes and new pages, even adapting its crawl rate based on parameters and estimated change frequencies.

Experimental crawling: Experiment with crawling techniques, such as choice of what to crawl, order of crawled, crawling using diverse protocols, and analysis and archiving of crawl results.

Page 6: CS 430 / INFO 430:  Information Discovery

6

Heritrix

Design parameters

• Extensible. Many components are plugins that can be rewritten for different tasks.

• Distributed. A crawl can be distributed in a symmetric fashion across many machines.

• Scalable. Size of within memory data structures is bounded.

• High performance. Performance is limited by speed of Internet connection (e.g., with 160 Mbit/sec connection, downloads 50 million documents per day).

• Polite. Options of weak or strong politeness.

• Continuous. Will support continuous crawling.

Page 7: CS 430 / INFO 430:  Information Discovery

7

Heritrix: Main Components

Scope: Determines what URIs are ruled into or out of a certain crawl. Includes the seed URIs used to start a crawl, plus the rules to determine which discovered URIs are also to be scheduled for download.

Frontier: Tracks which URIs are scheduled to be collected, and those that have already been collected. It is responsible for selecting the next URI to be tried, and prevents the redundant rescheduling of already-scheduled URIs.

Processor Chains: Modular Processors that perform specific, ordered actions on each URI in turn. These include fetching the URI, analyzing the returned results, and passing discovered URIs back to the Frontier.

Page 8: CS 430 / INFO 430:  Information Discovery

8

Mercator: Main Components

• Crawling is carried out by multiple worker threads, e.g., 500 threads for a big crawl.

• The URL frontier stores the list of absolute URLs to download.

• The DNS resolver resolves domain names into IP addresses.

• Protocol modules download documents using appropriate protocol (e.g., HTML).

• Link extractor extracts URLs from pages and converts to absolute URLs.

• URL filter and duplicate URL eliminator determine which URLs to add to frontier.

Page 9: CS 430 / INFO 430:  Information Discovery

9

Building a Web Crawler: Links are not Easy to Extract

Relative/AbsoluteCGI

– Parameters– Dynamic generation of pages

Server-side scriptingServer-side image mapsLinks buried in scripting code

Page 10: CS 430 / INFO 430:  Information Discovery

10

Mercator: The URL Frontier

A repository with two pluggable methods: add a URL, get a URL.

Most web crawlers use variations of breadth-first traversal, but ...

• Most URLs on a web page are relative (about 80%).

• A single FIFO queue, serving many threads, would send many simultaneous requests to a single server.

Weak politeness guarantee: Only one thread allowed to contact a particular web server.

Stronger politeness guarantee: Maintain n FIFO queues, each for a single host, which feed the queues for the crawling threads by rules based on priority and politeness factors.

Page 11: CS 430 / INFO 430:  Information Discovery

11

Mercator: Duplicate URL Elimination

Duplicate URLs are not added to the URL Frontier

Requires efficient data structure to store all URLs that have been seen and to check a new URL.

In memory:

Represent URL by 8-byte checksum. Maintain in-memory hash table of URLs.

Requires 5 Gigabytes for 1 billion URLs.

Disk based:

Combination of disk file and in-memory cache with batch updating to minimize disk head movement.

Page 12: CS 430 / INFO 430:  Information Discovery

12

Mercator: Domain Name Lookup

Resolving domain names to IP addresses is a major bottleneck of web crawlers.

Approach:

• Separate DNS resolver and cache on each crawling computer.

• Create multi-threaded version of DNS code (BIND).

These changes reduced DNS loop-up from 70% to 14% of each thread's elapsed time.

Page 13: CS 430 / INFO 430:  Information Discovery

13

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 14: CS 430 / INFO 430:  Information Discovery

14

Research Topics in Web Crawling

• How frequently to crawl and what strategies to use.

• Identification of anomalies and crawling traps.

• Strategies for crawling based on the content of web pages (focused and selective crawling).

• Duplicate detection.

Page 15: CS 430 / INFO 430:  Information Discovery

15

Further Reading

Heritrixhttp://crawler.archive.org/

Allan Heydon and Marc Najork, Mercator: A Scalable, Extensible Web Crawler. Compaq Systems Research Center, June 26, 1999. http://www.research.compaq.com/SRC/mercator/papers/www/paper.html

Page 16: CS 430 / INFO 430:  Information Discovery

16

Indexing the Web Goals: Precision

Short queries applied to very large numbers of items leads to large numbers of hits.

• Goal is that the first 10-100 hits presented should satisfy the user's information need

-- requires ranking hits in order that fits user's requirements

• Recall is not an important criterion

Completeness of index is not an important factor.

• Comprehensive crawling is unnecessary

Page 17: CS 430 / INFO 430:  Information Discovery

17

Concept of Relevance

Document measures

Relevance, as conventionally defined, is binary (relevant or not relevant). It is usually estimated by the similarity between the terms in the query and each document.

Importance measures documents by their likelihood of being useful to a variety of users. It is usually estimated by some measure of popularity.

Web search engines rank documents by a combination of relevance and importance. The goal is to present the user with the most important of the relevant documents.

Page 18: CS 430 / INFO 430:  Information Discovery

18

Ranking Options

1. Paid advertisers

2. Manually created classification

3. Vector space ranking with corrections for document length

4. Extra weighting for specific fields, e.g., title, anchors, etc.

5. Popularity, e.g., PageRank

The balance between 3, 4, and 5 is not made public.

Page 19: CS 430 / INFO 430:  Information Discovery

19

Bibliometrics

Techniques that use citation analysis to measure the similarity of journal articles or their importance

Bibliographic coupling: two papers that cite many of the same papers

Co-citation: two papers that were cited by many of the same papers

Impact factor (of a journal): frequency with which the average article in a journal has been cited in a particular year or period

Page 20: CS 430 / INFO 430:  Information Discovery

20

Citation Graph

Paper

cites

is cited by

Note that journal citations always refer to earlier work.

Page 21: CS 430 / INFO 430:  Information Discovery

21

Graphical Analysis of Hyperlinks on the Web

This page links to many other pages (hub)

Many pages link to this page (authority)

12

34

5 6

Page 22: CS 430 / INFO 430:  Information Discovery

22

Cornell Note

Jon Kleinberg of Cornell Computer Science has carried out extensive research in this area, both theoretical and practical development of new algorithms. In particular he has studied hubs (documents that refer to many others) and authorities (documents that are referenced by many others).

This is one of the topics covered in CS/INFO 685, The Structure of Information Networks.

Page 23: CS 430 / INFO 430:  Information Discovery

23

PageRank Algorithm

Used to estimate importance of documents.

Concept:

The rank of a web page is higher if many pages link to it.

Links from highly ranked pages are given greater weight than links from less highly ranked pages.

Page 24: CS 430 / INFO 430:  Information Discovery

24

Intuitive Model (Basic Concept)

Basic (no damping)

A user:

1. Starts at a random page on the web

2. Selects a random hyperlink from the current page and jumps to the corresponding page

3. Repeats Step 2 a very large number of times

Pages are ranked according to the relative frequency with which they are visited.

Page 25: CS 430 / INFO 430:  Information Discovery

25

Matrix Representation

P1 P2 P3 P4 P5 P6 Number

P1 1 1

P2 1 1 2

P3 1 1 1 3

P4 1 1 1 1 4

P5 1 1

P6 1 1

Cited page (to)

Citing page (from)

Number 4 2 1 1 3 1

Page 26: CS 430 / INFO 430:  Information Discovery

26

Basic Algorithm: Normalize by Number of Links from Page

P1 P2 P3 P4 P5 P6

P1 0.33

P2 0.25 1

P3 0.25 0.5 1

P4 0.25 0.5 0.33 1

P5 0.25

P6 0.33

Cited page

Citing page

Number 4 2 1 1 3 1

= BNormalized link matrix

Page 27: CS 430 / INFO 430:  Information Discovery

27

Basic Algorithm: Weighting of Pages

Initially all pages have weight 1

w0 = 1

1

1

1

1

1

Recalculate weights

w1 = Bw0 =

0.33

1.25

1.75

2.08

0.25

0.33

If the user starts at a random page, the jth element of w1 is the probability of reaching page j after one step.

Page 28: CS 430 / INFO 430:  Information Discovery

28

Basic Algorithm: Iterate

Iterate: wk = Bwk-1

0.33

1.25

1.75

2.08

0.25

0.33

0.08

1.83

2.79

1.12

0.08

0.08

0.03

2.80

2.06

1.05

0.02

0.03

->

->

->

->

->

->

0.00

2.39

2.39

1.19

0.00

0.00

1

1

1

1

1

1

w0 w1 w2 w3 ... converges to ... w

Page 29: CS 430 / INFO 430:  Information Discovery

29

Graphical Analysis of Hyperlinks on the Web

There is no link out of {2, 3, 4}1

2

34

5 6

Page 30: CS 430 / INFO 430:  Information Discovery

30

Google PageRank with Damping

A user:

1. Starts at a random page on the web

2a. With probability d, selects any random page and jumps to it

2b. With probability 1-d, selects a random hyperlink from the current page and jumps to the corresponding page

3. Repeats Step 2a and 2b a very large number of times

Pages are ranked according to the relative frequency with which they are visited.

Page 31: CS 430 / INFO 430:  Information Discovery

31

The PageRank Iteration

The basic method iterates using the normalized link matrix, B.

wk = Bwk-1

This w is the high order eigenvector of B

PageRank iterates using a damping factor. The method iterates:

wk = dw0 + (1 - d)Bwk-1

w0 is a vector with every element equal to 1.d is a constant found by experiment.

Page 32: CS 430 / INFO 430:  Information Discovery

32

Iterate with Damping

Iterate: wk = Bwk-1 (d = 0.3)

0.53

1.18

1.53

1.76

0.48

0.53

0.41

1.46

2.03

1.29

0.39

0.41

0.39

1.80

1.78

1.26

0.37

0.39

->

->

->

->

->

->

0.38

1.68

1.87

1.31

0.37

0.38

1

1

1

1

1

1

w0 w1 w2 w3 ... converges to ... w

Page 33: CS 430 / INFO 430:  Information Discovery

33

Google: PageRank

The Google PageRank algorithm is usually written with the following notation

If page A has pages Ti pointing to it.– d: damping factor– C(A): number of links out of A

Iterate until:( ) ( ) ( )

( )⎟⎟⎠⎞

⎜⎜⎝⎛

+−= ∑=

n

i i

i

TCTPddAP

1

1

Page 34: CS 430 / INFO 430:  Information Discovery

34

Information Retrieval Using PageRank

Simple Method

Consider all hits (i.e., all document vectors that share at least one term with the query vector) as equal.

Display the hits ranked by PageRank.

The disadvantage of this method is that it gives no attention to how closely a document matches a query

Page 35: CS 430 / INFO 430:  Information Discovery

35

Combining Term Weighting with Reference Pattern Ranking

Combined Method

1. Find all documents that share a term with the query vector.

2. The similarity, using conventional term weighting, between the query and document j is sj.

3. The rank of document j using PageRank or other reference pattern ranking is pj.

4. Calculate a combined rank cj = sj + (1- )pj, where is a constant.

5. Display the hits ranked by cj.

This method is used in several commercial systems, but the details have not been published.