39
04/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval Information Retrieval

Information Retrieval

  • Upload
    mura

  • View
    33

  • Download
    0

Embed Size (px)

DESCRIPTION

Information Retrieval. Information Retrieval Systems. key word query. Document. IR System. document. In full text retrieval, all the words in each document are considered to be keywords. We use the word term to refer to the words in a document - PowerPoint PPT Presentation

Citation preview

Page 1: Information Retrieval

04/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval

Information Retrieval

Page 2: Information Retrieval

4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval

Information Retrieval Systems

DocumentIR System

key word query

document

Page 3: Information Retrieval

4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval

Keyword Search

In full text retrieval, all the words in each document are considered to be keywords. We use the word term to refer to the words in a document

Ranking of documents on the basis of estimated relevance to a query is critical

Page 4: Information Retrieval

4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval

Similarity Based Retrieval

Similarity based retrieval - retrieve documents similar to a given document

Similarity can be used to refine answer set to keyword query User selects a few relevant documents from those retrieved by

keyword query, and system finds other documents similar to these

Page 5: Information Retrieval

4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval

Similarity Measures

A similarity measure is a function that computes the degree of similarity between two vectors.

Using a similarity measure between the query and each document: It is possible to rank the retrieved documents in the order of

presumed relevance. It is possible to enforce a certain threshold so that the size of

the retrieved set can be controlled.

Page 6: Information Retrieval

4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval

Relevance Ranking

Relevance ranking is based on factors such as Term frequency

Frequency of occurrence of query keyword in document Inverse document frequency

How many documents the query keyword occurs in Fewer give more importance to keyword

Hyperlinks to documents More links to a document document is more important

Page 7: Information Retrieval

4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval

Relevance Ranking Using Terms (Cont.)

Most systems add to the above model Words that occur in title, author list, section headings, etc. are

given greater importance Words whose first occurrence is late in the document are given

lower importance Very common words such as “a”, “an”, “the”, “it” etc are

eliminated Called stop words

Proximity: if keywords in query occur close together in the document, the document has higher importance than if they occur far apart

Page 8: Information Retrieval

4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval

Vector Space Model

Assume t distinct terms remain after preprocessing; call them index terms or the vocabulary.

These “orthogonal” terms form a vector space. Dimension = t = |vocabulary|

Each term, i, in a document or query, j, is given a real-valued weight, wij.

Both documents and queries are expressed as t-dimensional vectors: dj = (w1j, w2j, …, wtj)

Page 9: Information Retrieval

4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval

Term Weights

More frequent terms in a document are more important, i.e. more indicative of the topic. fij = frequency of term i in document j

May want to normalize term frequency (tf) by dividing by the frequency of the most common term in the document: tfij = fij / maxi{fij}

Page 10: Information Retrieval

4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval

Reverse Term Weights

Terms that appear in many different documents are less indicative of overall topic.

df i = document frequency of term i = number of documents containing term i idfi = inverse document frequency of term i, = log2 (N/ df i)

(N: total number of documents) An indication of a term’s discrimination power. Log used to dampen the effect relative to tf.

Page 11: Information Retrieval

4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval

TF-IDF Weighting

A typical combined term importance indicator is tf-idf weighting:wij = tfij idfi = tfij log2 (N/ dfi)

A term occurring frequently in the document but rarely in the rest of the collection is given high weight.

Many other ways of determining term weights have been proposed.

Experimentally, tf-idf has been found to work well.

Page 12: Information Retrieval

4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval

Inner Product Measure

Similarity between vectors for the document di and query q can be computed as the vector inner product:

sim(dj,q) = dj•q = wij · wiq

where wij is the weight of term i in document j and wiq is the weight of term i in the query

For binary vectors, the inner product is the number of matched query terms in the document (size of intersection).

For weighted term vectors, it is the sum of the products of the weights of the matched terms.

t

i 1

Page 13: Information Retrieval

4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval

Inner Product -- Examples

Binary: D = 1, 1, 1, 0, 1, 1, 0 Q = 1, 0 , 1, 0, 0, 1, 1

sim(D, Q) = 3

retrieval

database

architecture

computer

textmanagement

information

Size of vector = size of vocabulary = 70 means corresponding term not found

in document or query

Weighted: D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + 1T3

Q = 0T1 + 0T2 + 2T3

sim(D1 , Q) = 2*0 + 3*0 + 5*2 = 10 sim(D2 , Q) = 3*0 + 7*0 + 1*2 = 2

Problems?

Page 14: Information Retrieval

4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval

Cosine Similarity Measure Cosine similarity measures the cosine of the angle

between two vectors. Inner product normalized by the vector lengths.

t3

t1

t2

D1

D2

Q

t

i

t

i

t

i

ww

ww

qdqd

iqij

iqij

j

j

1 1

22

1

)(

CosSim(dj, q) =

Page 15: Information Retrieval

4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval

Relevance Using Hyperlinks

Problem with key words search? Problem with most frequented visited website search?

Idea: use popularity of Web site (e.g. how many people visit it) to rank site pages that match given keywords

Problem: hard to find actual popularity of site

Page 16: Information Retrieval

4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval

Different Ranking Factors

Key word and anchor text based search find all the related pages first

PageRank rank the search result set A high ranked page is not interesting to you at all if it is

not related

Page 17: Information Retrieval

4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval

Link Counts

Linked by 2 Important Pages

Linked by 2 Unimportant

pages

Sep’s Home Page

Taher’s Home Page

Yahoo! CNNDB Pub Server CS361

Page 18: Information Retrieval

4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval

Definition of PageRank

jBj j

i xN

xi

1

let us calculate

Page 19: Information Retrieval

4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval

Definition of PageRank

1/2 1/2 1 1

0.1 0.10.1

0.05

Yahoo!CNNDB Pub Server

Taher Sep0.25

Page 20: Information Retrieval

4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval

PageRank Diagram

Initialize all nodes to rank

0.3330.333

0.333

nxi

1)0(

Page 21: Information Retrieval

4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval

PageRank Diagram

Propagate ranks across links(multiplying by link weights)

0.167

0.167

0.3330.333

Page 22: Information Retrieval

4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval

PageRank Diagram

0.3330.5

0.167

)0()1( 1j

Bj ji x

Nx

i

Page 23: Information Retrieval

4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval

PageRank Diagram

0.167

0.167

0.50.167

Page 24: Information Retrieval

4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval

PageRank Diagram

0.50.333

0.167

)1()2( 1j

Bj ji x

Nx

i

Page 25: Information Retrieval

4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval

PageRank Diagram

After a while…

0.40.4

0.2

jBj j

i xN

xi

1

Page 26: Information Retrieval

4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval

Computing PageRank

Initialize:

Repeat until convergence:

)()1( 1 kj

Bj j

ki x

Nx

i

nxi

1)0(

importance of page i

pages j that link to page i

number of outlinks from page j

importance of page j

Page 27: Information Retrieval

4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval

Definition of PageRank The importance of a page is given by the

importance of the pages that link to it d is a damping factor, usually 0.85

jBj j

i xN

ddxi

1)1(

importance of page i

pages j that link to page i

number of outlinks from page j

importance of page j

Page 28: Information Retrieval

4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval

Synonyms and Homonyms Synonyms

E.g. document: “motorcycle repair”, query: “motorcycle maintenance” need to realize that “maintenance” and “repair” are synonyms

System can extend query as “motorcycle and (repair or maintenance)”

Homonyms E.g. “object” has different meanings as noun/verb Can disambiguate meanings (to some extent) from the context

Page 29: Information Retrieval

4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval

Indexing of Documents An inverted index maps each keyword Ki to a set of

documents Si that contain the keyword Documents identified by identifiers

Page 30: Information Retrieval

4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval

Measuring Retrieval Effectiveness

Relevant performance metrics: Precision - what percentage of the retrieved documents

are relevant to the query. Recall - what percentage of the documents relevant to

the query were retrieved.

Page 31: Information Retrieval

4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval

Precision and Recall

Precision: a/(a+c) Among all the retrieved, how many are actual

positive? Recall: a/(a+b)

Percentage of actual positive data retrieved

F measure: 2pr/(r+p)

true false

positive a b

negative c d

predict

actual

Page 32: Information Retrieval

4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval

Training Data

Problem: which documents are actually relevant, and which are not Usual solution: human judges Create a corpus of documents and queries, with humans

deciding which documents are relevant to which queries TREC (Text REtrieval Conference) Benchmark

Page 33: Information Retrieval

4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval

Web Crawling

Web crawlers are programs that locate and gather information on the Web Recursively follow hyperlinks present in known documents, to

find other documents Starting from a seed set of documents

Fetched documents Handed over to an indexing system Can be discarded after indexing, or store as a cached copy

Page 34: Information Retrieval

4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval

Browsing Storing related documents together in a library

facilitates browsing users can see not only requested document but also

related ones. Browsing is facilitated by classification system that

organizes logically related documents together. Organization is hierarchical: classification

hierarchy

Page 35: Information Retrieval

4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval

A Classification Hierarchy For A Library System

Page 36: Information Retrieval

4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval

Classification DAG

Documents can reside in multiple places in a hierarchy in an information retrieval system, since physical location is not important.

Classification hierarchy is thus Directed Acyclic Graph (DAG)

Page 37: Information Retrieval

4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval

A Classification DAG For A Library Information Retrieval System

Page 38: Information Retrieval

4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval

Web Directories

A Web directory is just a classification directory on Web pages E.g. Yahoo! Directory, Open Directory project Issues:

What should the directory hierarchy be? Given a document, which nodes of the directory are categories

relevant to the document Often done manually

Classification of documents into a hierarchy may be done based on term similarity

Page 39: Information Retrieval

4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval

Some slides of this slide set adapted from the following slides:

Prof. James Allan’s course slides Extrapolation Methods for Accelerating PageRank Computations by

Sepandar D. Kamvar et. al.