33
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4

CS 430 / INFO 430 Information Retrieval

Embed Size (px)

DESCRIPTION

CS 430 / INFO 430 Information Retrieval. Lecture 4 Searching Full Text 4. Course Administration. Assignment 1 has been posted. It is a programming assignment and is due on Sunday, September 17 at 11 p.m. Follow the instructions carefully . Send questions to [email protected]. - PowerPoint PPT Presentation

Citation preview

Page 1: CS 430 / INFO 430  Information Retrieval

1

CS 430 / INFO 430 Information Retrieval

Lecture 4

Searching Full Text 4

Page 2: CS 430 / INFO 430  Information Retrieval

2

Course Administration

Assignment 1 has been posted. It is a programming assignment and is due on Sunday, September 17 at 11 p.m.

Follow the instructions carefully.

Send questions to [email protected].

This is a preliminary statement of the assignment. Watch the Web site for any minor changes.

Page 3: CS 430 / INFO 430  Information Retrieval

3

Inverted File

Inverted file:

An inverted file is list of search terms that are organized for associative look-up, i.e., to answer the questions:

• In which documents does a specified search term appear?

• Where within each document does each term appear? (There may be several occurrences.)

In a free text search system, the word list and the postings file together provide an inverted file system. In addition, they contain the data needed to calculate weights and information that is used to display results.

Page 4: CS 430 / INFO 430  Information Retrieval

4

Inverted File -- Basic Concept

Word Document

abacus 3

19 22

actor 2 19 29

aspen 5 atoll 11

34

This is called an index file, a word list, or a vocabulary file.

Stop words are removed before building the index.

Page 5: CS 430 / INFO 430  Information Retrieval

5

Inverted List -- Definitions

Posting: Entry in an inverted file system that applies to a single instance of a term within a document, e.g., there are three postings for "abacus":

abacus 3

abacus 19

abacus 22

Inverted List: A list of all the postings in an inverted file system that apply to a specific word, e.g.

abacus 3 19 22

Page 6: CS 430 / INFO 430  Information Retrieval

6

Organization of Files for Full Text Searching

Term Pointer topostings

ant

bee

cat

dog

elk

fox

gnu

hog

Inverted lists

Word list (index file) Postings Documents store

Page 7: CS 430 / INFO 430  Information Retrieval

7

Representation of Inverted Files

Document store: Stores the documents. Important for user interface design. [Repositories for the storage of document collections are covered in CS 431.]

Word list (vocabulary file): Stores list of terms (keywords). Designed for searching and sequential processing, e.g., for range queries, (lexicographic index). May be held in memory.

Postings file: Stores an inverted list (postings list) of postings for each term. Designed for rapid merging of lists and calculation of similarities. Each list is usually stored sequentially. Can be very large.

Page 8: CS 430 / INFO 430  Information Retrieval

8

Document Store

The Documents Store holds the corpus that is being indexed. The corpus may be:

• primary documents, e.g., electronic journal articles or Web pages.

• surrogates, e.g., catalog records or abstracts, which refer to the primary documents.

Page 9: CS 430 / INFO 430  Information Retrieval

9

Document Store

The storage of the document store may be:

Central (monolithic) - all documents stored together on a single server (e.g., library catalog)

Distributed database - all documents managed together but stored on several servers (e.g., Medline, Westlaw)

Highly distributed - documents stored on independently managed servers (e.g., the Web)

Each requires: a document ID, which is a unique identifier that can be used by the search system to refer to the document, and a location counter, which can be used to specify location of words or characters within a document.

Page 10: CS 430 / INFO 430  Information Retrieval

10

Documents Store for Web Search Systems

For Web search systems:

• A document is a Web page.

• The documents store is the Web.

• The document ID is the URL of the document.

Indexes are built using a web crawler, which retrieves each page on the Web for indexing. After indexing, the local copy of each page is discarded, unless stored in a cache.

(In addition to the usual word list and postings file the indexing system stores contextual information, which will be discussed in a later lecture.)

Page 11: CS 430 / INFO 430  Information Retrieval

11

Use of Inverted Files for Evaluating a Boolean Query

3 19 22 2 19 29

To evaluate the and operator, merge the two inverted lists

with a logical AND operation.

Examples: abacus and actor

Postings for abacus

Postings for actor

Document 19 is the only document that contains both terms, "abacus" and "actor".

Page 12: CS 430 / INFO 430  Information Retrieval

12

Use of Inverted Files for Calculating Similarities

In the term vector space, if q is query and dj a document, then q and dj have no terms in common iff q.dj = 0.

1. To calculate all the non-zero similarities find R, the set of all the documents, dj, that contain at least one term in the query:

2. Merge the inverted lists for each term ti in the query, with a logical or, to establish the set, R.

3. For each dj R, calculate Similarity(q, dj), using appropriate weights.

4. Return the elements of R in ranked order.

Page 13: CS 430 / INFO 430  Information Retrieval

13

Enhancements to Inverted Files -- Concept

Location: Each posting holds information about the location of each term within the document.

Uses

user interface design -- highlight location of search term

adjacency and near operators (in Boolean searching)

Frequency: Each inverted list includes the number of postings for each term.

Uses

term weightingquery processing optimization

Page 14: CS 430 / INFO 430  Information Retrieval

14

Inverted File -- Concept (Enhanced)

Word Postings Document Location

abacus 4 3 94 19 7 19 212

22 56actor 3 2 66

19 213 29 45

aspen 1 5 43atoll 3 11 3

11 70 34 40

Inverted list for term actor

Page 15: CS 430 / INFO 430  Information Retrieval

15

Data for Calculating Weights

The calculation of weights requires extra data to be held in the inverted file system.

For each term, tj and document, di

fij number of occurrences of tj in di

For each term, tj

nj number of documents containing tj

For each document, di

mi maximum frequency of any term in di

For the entire document filen total number of documents

Page 16: CS 430 / INFO 430  Information Retrieval

16

Word List: Individual Records for Each Term

The record for term j in the word list contains:

term j

pointer to inverted (postings) list for term j

number of documents in which term j occurs (nj)

Page 17: CS 430 / INFO 430  Information Retrieval

17

Decisions in Building an Inverted File System: Lexicographic Order

It is important that the word list can be processed sequentially, i.e, in alphabetic order.

• To search with wild cards, e.g. comp*, which expands to every term beginning with the letters "comp".

• To list results for browsing lists of search terms.

This is a special case of of the mathematical concept of lexicographic order.

Page 18: CS 430 / INFO 430  Information Retrieval

18

Decisions in Building an Inverted File System: Query Languages

Some query options may require huge computation, e.g.,

Regular expressions

If inverted files are stored in lexicographic order,

comp* can be processed efficiently *comp cannot be processed efficiently

Logical operators

If A and B are search terms

A or B can be processed by comparing two moderate sized lists (not A) or (not B) requires two very large lists

Page 19: CS 430 / INFO 430  Information Retrieval

19

Decisions in Building an Inverted File System: Storage and Performance

Storage

Inverted file systems are big, typically 10% to 100% the size of the collection of documents.

Update performance

It must be possible, with a reasonable amount of computation, to:

(a) Add a large batch of documents

(b) Add a single document

Retrieval performance

Retrieval must be fast enough to satisfy users and not use excessive resources.

Page 20: CS 430 / INFO 430  Information Retrieval

20

Postings File

The postings file stores the elements of a sparse matrix, the components of the term vector space, with weights.

It is stored as a separate inverted list for each column, i.e., a list corresponding to each term in the index file.

Each element in an inverted list is called a posting, i.e., the occurrence of a term in a document

Each list consists of one or many individual postings.

Page 21: CS 430 / INFO 430  Information Retrieval

21

Postings File:A Linked List for Each Term

1 abacus

3 94

19 7

19 212

22 56

2 actor

2 66

19 213

29 45

3 aspen

5 43

4 atoll

11 3

11 70

34 40

A linked list for each term is convenient to process sequentially, but slow to update when the lists are long.

Page 22: CS 430 / INFO 430  Information Retrieval

22

Length of Postings File

For a common term there may be very large numbers of postings for a given term.

Example:

1,000,000,000 documents1,000,000 distinct wordsaverage length 1,000 words per document

1012 postings

By Zipf's law, the 10th ranking word occurs, approximately:

(1012/10)/10 times= 1010 times

Page 23: CS 430 / INFO 430  Information Retrieval

23

Postings File

Merging inverted lists is the most computationally intensive task in many information retrieval systems.

Since inverted lists may be long, it is important to match postings efficiently.

Usually, the inverted lists will be held on disk and paged into memory for matching. Therefore algorithms for matching postings process the lists sequentially.

For efficient matching, the inverted lists should all be sorted in the same sequence.

Inverted lists are commonly cached to minimize disk accesses.

Page 24: CS 430 / INFO 430  Information Retrieval

24

Word List

On disk

If a word list is held on disk, search time is dominated by the number of disk accesses.

In memory

Suppose that a word list has 1,000,000 distinct terms.

Each index entry consists of the term, some basic statistics and a pointer to the inverted list, average 100 characters.

Size of index is 100 megabytes, which can easily be held in memory of a dedicated computer.

Page 25: CS 430 / INFO 430  Information Retrieval

25

File Structures for Inverted Files: Linear Index

Advantages

Can be searched quickly, e.g., by binary search, O(log n)

Good for lexicographic processing, e.g., comp*

Convenient for batch updating

Economical use of storage

Disadvantages

Index must be rebuilt if an extra term is added

Page 26: CS 430 / INFO 430  Information Retrieval

26

File Structures for Inverted Files: Binary Tree

elk

bee hog

cat

dog

foxant

gnu

Input: elk, hog, bee, fox, cat, gnu, ant, dog

Page 27: CS 430 / INFO 430  Information Retrieval

27

File Structures for Inverted Files: Binary Tree

Advantages

Can be searched quickly

Convenient for batch updating

Easy to add an extra term

Economical use of storage

Disadvantages

Less good for lexicographic processing, e.g., comp*

Tree tends to become unbalanced

If the index is held on disk, important to optimize the number of disk accesses

Page 28: CS 430 / INFO 430  Information Retrieval

28

File Structures for Inverted Files: Binary Tree

Calculation of maximum depth of tree.

Illustrates importance of balanced trees.

Worst case: depth = n

O(n)

Ideal case: depth = log(n + 1)/log 2

O(log n)

Page 29: CS 430 / INFO 430  Information Retrieval

29

File Structures for Inverted Files: Right Threaded Binary Tree

Threaded tree:

A binary search tree in which each node uses an otherwise-empty left child link to refer to the node's in-order predecessor and an empty right child link to refer to its in-order successor.

Right-threaded tree:

A variant of a threaded tree in which only the right thread, i.e. link to the successor, of each node is maintained. Can be used for lexicographic processing.

A good data structure when held in memory

Knuth vol 1, 2.3.1, page 325.

Page 30: CS 430 / INFO 430  Information Retrieval

30

File Structures for Inverted Files: Right Threaded Binary Tree

dog

bee

ant cat

gnu

elk

fox

hog

NULL

Page 31: CS 430 / INFO 430  Information Retrieval

31

File Structures for Inverted Files: B-trees

B-tree of order m:

A balanced, multiway search tree:

• Each node stores many keys

• Root has between 2 and 2m keys. All other internal nodes have between m and 2m keys.

• If ki is the ith key in a given internal node

-> all keys in the (i-1)th child are smaller than ki

-> all keys in the ith child are bigger than ki

• All leaves are at the same depth

Page 32: CS 430 / INFO 430  Information Retrieval

32

File Structures for Inverted Files: B-trees

B-tree example (order 2)

50 65

10 19 35 55 59 70 90 98

1 5 8 9

12 14 18

36 47 66 68

72 73

91 95 97

Every arrow points to a node containing between 2 and 4 keys.A node with k keys has k + 1 pointers.

21 24 28

Page 33: CS 430 / INFO 430  Information Retrieval

33

File Structures for Inverted Files: B+-tree

• A B-tree is used as an index

• Data is stored in the leaves of the tree, known as buckets

50 65

10 25 55 59 70 81 90

... D9 D51 ... D54 D66... D81 ...

Example: B+-tree of order 2, bucket size 4

(Implementation of B+-trees is covered in CS 432.)