Information Retrieval

04/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval

Information Retrieval


Information Retrieval Systems

DocumentIR System

key word query

document


Keyword Search

In full text retrieval, all the words in each document are considered to be keywords. We use the word term to refer to the words in a document

Ranking of documents on the basis of estimated relevance to a query is critical


Similarity Based Retrieval

Similarity based retrieval - retrieve documents similar to a given document

Similarity can be used to refine answer set to keyword query User selects a few relevant documents from those retrieved by

keyword query, and system finds other documents similar to these


Similarity Measures

A similarity measure is a function that computes the degree of similarity between two vectors.

Using a similarity measure between the query and each document: It is possible to rank the retrieved documents in the order of

presumed relevance. It is possible to enforce a certain threshold so that the size of

the retrieved set can be controlled.


Relevance Ranking

Relevance ranking is based on factors such as Term frequency

Frequency of occurrence of query keyword in document Inverse document frequency

How many documents the query keyword occurs in Fewer give more importance to keyword

Hyperlinks to documents More links to a document document is more important


Relevance Ranking Using Terms (Cont.)

Most systems add to the above model Words that occur in title, author list, section headings, etc. are

given greater importance Words whose first occurrence is late in the document are given

lower importance Very common words such as “a”, “an”, “the”, “it” etc are

eliminated Called stop words

Proximity: if keywords in query occur close together in the document, the document has higher importance than if they occur far apart


Vector Space Model

Assume t distinct terms remain after preprocessing; call them index terms or the vocabulary.

These “orthogonal” terms form a vector space. Dimension = t = |vocabulary|

Each term, i, in a document or query, j, is given a real-valued weight, wij.

Both documents and queries are expressed as t-dimensional vectors: dj = (w1j, w2j, …, wtj)


Term Weights

More frequent terms in a document are more important, i.e. more indicative of the topic. fij = frequency of term i in document j

May want to normalize term frequency (tf) by dividing by the frequency of the most common term in the document: tfij = fij / maxi{fij}


Reverse Term Weights

Terms that appear in many different documents are less indicative of overall topic.

df i = document frequency of term i = number of documents containing term i idfi = inverse document frequency of term i, = log2 (N/ df i)

(N: total number of documents) An indication of a term’s discrimination power. Log used to dampen the effect relative to tf.


TF-IDF Weighting

A typical combined term importance indicator is tf-idf weighting:wij = tfij idfi = tfij log2 (N/ dfi)

A term occurring frequently in the document but rarely in the rest of the collection is given high weight.

Many other ways of determining term weights have been proposed.

Experimentally, tf-idf has been found to work well.


Inner Product Measure

Similarity between vectors for the document di and query q can be computed as the vector inner product:

sim(dj,q) = dj•q = wij · wiq

where wij is the weight of term i in document j and wiq is the weight of term i in the query

For binary vectors, the inner product is the number of matched query terms in the document (size of intersection).

For weighted term vectors, it is the sum of the products of the weights of the matched terms.

t

i 1


Inner Product -- Examples

Binary: D = 1, 1, 1, 0, 1, 1, 0 Q = 1, 0 , 1, 0, 0, 1, 1

sim(D, Q) = 3

retrieval

database

architecture

computer

textmanagement

information

Size of vector = size of vocabulary = 70 means corresponding term not found

in document or query

Weighted: D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + 1T3

Q = 0T1 + 0T2 + 2T3

sim(D1 , Q) = 2*0 + 3*0 + 5*2 = 10 sim(D2 , Q) = 3*0 + 7*0 + 1*2 = 2

Problems?


Cosine Similarity Measure Cosine similarity measures the cosine of the angle

between two vectors. Inner product normalized by the vector lengths.

t3

t1

t2

D1

D2

Q

t

i

t

i

t

i

ww

ww

qdqd

iqij

iqij

j

j

1 1

22

1

)(

CosSim(dj, q) =


Relevance Using Hyperlinks

Problem with key words search? Problem with most frequented visited website search?

Idea: use popularity of Web site (e.g. how many people visit it) to rank site pages that match given keywords

Problem: hard to find actual popularity of site


Different Ranking Factors

Key word and anchor text based search find all the related pages first

PageRank rank the search result set A high ranked page is not interesting to you at all if it is

not related


Link Counts

Linked by 2 Important Pages

Linked by 2 Unimportant

pages

Sep’s Home Page

Taher’s Home Page

Yahoo! CNNDB Pub Server CS361


Definition of PageRank

jBj j

i xN

xi

1

let us calculate


Definition of PageRank

1/2 1/2 1 1

0.1 0.10.1

0.05

Yahoo!CNNDB Pub Server

Taher Sep0.25


PageRank Diagram

Initialize all nodes to rank

0.3330.333

0.333

nxi

1)0(


PageRank Diagram

Propagate ranks across links(multiplying by link weights)

0.167

0.167

0.3330.333


PageRank Diagram

0.3330.5

0.167

)0()1( 1j

Bj ji x

Nx

i


PageRank Diagram

0.167

0.167

0.50.167


PageRank Diagram

0.50.333

0.167

)1()2( 1j

Bj ji x

Nx

i


PageRank Diagram

After a while…

0.40.4

0.2

jBj j

i xN

xi

1


Computing PageRank

Initialize:

Repeat until convergence:

)()1( 1 kj

Bj j

ki x

Nx

i

nxi

1)0(

importance of page i

pages j that link to page i

number of outlinks from page j

importance of page j


Definition of PageRank The importance of a page is given by the

importance of the pages that link to it d is a damping factor, usually 0.85

jBj j

i xN

ddxi

1)1(

importance of page i

pages j that link to page i

number of outlinks from page j

importance of page j


Synonyms and Homonyms Synonyms

E.g. document: “motorcycle repair”, query: “motorcycle maintenance” need to realize that “maintenance” and “repair” are synonyms

System can extend query as “motorcycle and (repair or maintenance)”

Homonyms E.g. “object” has different meanings as noun/verb Can disambiguate meanings (to some extent) from the context


Indexing of Documents An inverted index maps each keyword Ki to a set of

documents Si that contain the keyword Documents identified by identifiers


Measuring Retrieval Effectiveness

Relevant performance metrics: Precision - what percentage of the retrieved documents

are relevant to the query. Recall - what percentage of the documents relevant to

the query were retrieved.


Precision and Recall

Precision: a/(a+c) Among all the retrieved, how many are actual

positive? Recall: a/(a+b)

Percentage of actual positive data retrieved

F measure: 2pr/(r+p)

true false

positive a b

negative c d

predict

actual


Training Data

Problem: which documents are actually relevant, and which are not Usual solution: human judges Create a corpus of documents and queries, with humans

deciding which documents are relevant to which queries TREC (Text REtrieval Conference) Benchmark


Web Crawling

Web crawlers are programs that locate and gather information on the Web Recursively follow hyperlinks present in known documents, to

find other documents Starting from a seed set of documents

Fetched documents Handed over to an indexing system Can be discarded after indexing, or store as a cached copy


Browsing Storing related documents together in a library

facilitates browsing users can see not only requested document but also

related ones. Browsing is facilitated by classification system that

organizes logically related documents together. Organization is hierarchical: classification

hierarchy


A Classification Hierarchy For A Library System


Classification DAG

Documents can reside in multiple places in a hierarchy in an information retrieval system, since physical location is not important.

Classification hierarchy is thus Directed Acyclic Graph (DAG)


A Classification DAG For A Library Information Retrieval System


Web Directories

A Web directory is just a classification directory on Web pages E.g. Yahoo! Directory, Open Directory project Issues:

What should the directory hierarchy be? Given a document, which nodes of the directory are categories

relevant to the document Often done manually

Classification of documents into a hierarchy may be done based on term similarity


Some slides of this slide set adapted from the following slides:

Prof. James Allan’s course slides Extrapolation Methods for Accelerating PageRank Computations by

Sepandar D. Kamvar et. al.