Upload
mura
View
33
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Information Retrieval. Information Retrieval Systems. key word query. Document. IR System. document. In full text retrieval, all the words in each document are considered to be keywords. We use the word term to refer to the words in a document - PowerPoint PPT Presentation
Citation preview
04/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval
Information Retrieval
4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval
Information Retrieval Systems
DocumentIR System
key word query
document
4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval
Keyword Search
In full text retrieval, all the words in each document are considered to be keywords. We use the word term to refer to the words in a document
Ranking of documents on the basis of estimated relevance to a query is critical
4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval
Similarity Based Retrieval
Similarity based retrieval - retrieve documents similar to a given document
Similarity can be used to refine answer set to keyword query User selects a few relevant documents from those retrieved by
keyword query, and system finds other documents similar to these
4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval
Similarity Measures
A similarity measure is a function that computes the degree of similarity between two vectors.
Using a similarity measure between the query and each document: It is possible to rank the retrieved documents in the order of
presumed relevance. It is possible to enforce a certain threshold so that the size of
the retrieved set can be controlled.
4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval
Relevance Ranking
Relevance ranking is based on factors such as Term frequency
Frequency of occurrence of query keyword in document Inverse document frequency
How many documents the query keyword occurs in Fewer give more importance to keyword
Hyperlinks to documents More links to a document document is more important
4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval
Relevance Ranking Using Terms (Cont.)
Most systems add to the above model Words that occur in title, author list, section headings, etc. are
given greater importance Words whose first occurrence is late in the document are given
lower importance Very common words such as “a”, “an”, “the”, “it” etc are
eliminated Called stop words
Proximity: if keywords in query occur close together in the document, the document has higher importance than if they occur far apart
4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval
Vector Space Model
Assume t distinct terms remain after preprocessing; call them index terms or the vocabulary.
These “orthogonal” terms form a vector space. Dimension = t = |vocabulary|
Each term, i, in a document or query, j, is given a real-valued weight, wij.
Both documents and queries are expressed as t-dimensional vectors: dj = (w1j, w2j, …, wtj)
4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval
Term Weights
More frequent terms in a document are more important, i.e. more indicative of the topic. fij = frequency of term i in document j
May want to normalize term frequency (tf) by dividing by the frequency of the most common term in the document: tfij = fij / maxi{fij}
4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval
Reverse Term Weights
Terms that appear in many different documents are less indicative of overall topic.
df i = document frequency of term i = number of documents containing term i idfi = inverse document frequency of term i, = log2 (N/ df i)
(N: total number of documents) An indication of a term’s discrimination power. Log used to dampen the effect relative to tf.
4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval
TF-IDF Weighting
A typical combined term importance indicator is tf-idf weighting:wij = tfij idfi = tfij log2 (N/ dfi)
A term occurring frequently in the document but rarely in the rest of the collection is given high weight.
Many other ways of determining term weights have been proposed.
Experimentally, tf-idf has been found to work well.
4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval
Inner Product Measure
Similarity between vectors for the document di and query q can be computed as the vector inner product:
sim(dj,q) = dj•q = wij · wiq
where wij is the weight of term i in document j and wiq is the weight of term i in the query
For binary vectors, the inner product is the number of matched query terms in the document (size of intersection).
For weighted term vectors, it is the sum of the products of the weights of the matched terms.
t
i 1
4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval
Inner Product -- Examples
Binary: D = 1, 1, 1, 0, 1, 1, 0 Q = 1, 0 , 1, 0, 0, 1, 1
sim(D, Q) = 3
retrieval
database
architecture
computer
textmanagement
information
Size of vector = size of vocabulary = 70 means corresponding term not found
in document or query
Weighted: D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + 1T3
Q = 0T1 + 0T2 + 2T3
sim(D1 , Q) = 2*0 + 3*0 + 5*2 = 10 sim(D2 , Q) = 3*0 + 7*0 + 1*2 = 2
Problems?
4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval
Cosine Similarity Measure Cosine similarity measures the cosine of the angle
between two vectors. Inner product normalized by the vector lengths.
t3
t1
t2
D1
D2
Q
t
i
t
i
t
i
ww
ww
qdqd
iqij
iqij
j
j
1 1
22
1
)(
CosSim(dj, q) =
4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval
Relevance Using Hyperlinks
Problem with key words search? Problem with most frequented visited website search?
Idea: use popularity of Web site (e.g. how many people visit it) to rank site pages that match given keywords
Problem: hard to find actual popularity of site
4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval
Different Ranking Factors
Key word and anchor text based search find all the related pages first
PageRank rank the search result set A high ranked page is not interesting to you at all if it is
not related
4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval
Link Counts
Linked by 2 Important Pages
Linked by 2 Unimportant
pages
Sep’s Home Page
Taher’s Home Page
Yahoo! CNNDB Pub Server CS361
4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval
Definition of PageRank
jBj j
i xN
xi
1
let us calculate
4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval
Definition of PageRank
1/2 1/2 1 1
0.1 0.10.1
0.05
Yahoo!CNNDB Pub Server
Taher Sep0.25
4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval
PageRank Diagram
Initialize all nodes to rank
0.3330.333
0.333
nxi
1)0(
4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval
PageRank Diagram
Propagate ranks across links(multiplying by link weights)
0.167
0.167
0.3330.333
4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval
PageRank Diagram
0.3330.5
0.167
)0()1( 1j
Bj ji x
Nx
i
4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval
PageRank Diagram
0.167
0.167
0.50.167
4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval
PageRank Diagram
0.50.333
0.167
)1()2( 1j
Bj ji x
Nx
i
4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval
PageRank Diagram
After a while…
0.40.4
0.2
jBj j
i xN
xi
1
4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval
Computing PageRank
Initialize:
Repeat until convergence:
)()1( 1 kj
Bj j
ki x
Nx
i
nxi
1)0(
importance of page i
pages j that link to page i
number of outlinks from page j
importance of page j
4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval
Definition of PageRank The importance of a page is given by the
importance of the pages that link to it d is a damping factor, usually 0.85
jBj j
i xN
ddxi
1)1(
importance of page i
pages j that link to page i
number of outlinks from page j
importance of page j
4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval
Synonyms and Homonyms Synonyms
E.g. document: “motorcycle repair”, query: “motorcycle maintenance” need to realize that “maintenance” and “repair” are synonyms
System can extend query as “motorcycle and (repair or maintenance)”
Homonyms E.g. “object” has different meanings as noun/verb Can disambiguate meanings (to some extent) from the context
4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval
Indexing of Documents An inverted index maps each keyword Ki to a set of
documents Si that contain the keyword Documents identified by identifiers
4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval
Measuring Retrieval Effectiveness
Relevant performance metrics: Precision - what percentage of the retrieved documents
are relevant to the query. Recall - what percentage of the documents relevant to
the query were retrieved.
4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval
Precision and Recall
Precision: a/(a+c) Among all the retrieved, how many are actual
positive? Recall: a/(a+b)
Percentage of actual positive data retrieved
F measure: 2pr/(r+p)
true false
positive a b
negative c d
predict
actual
4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval
Training Data
Problem: which documents are actually relevant, and which are not Usual solution: human judges Create a corpus of documents and queries, with humans
deciding which documents are relevant to which queries TREC (Text REtrieval Conference) Benchmark
4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval
Web Crawling
Web crawlers are programs that locate and gather information on the Web Recursively follow hyperlinks present in known documents, to
find other documents Starting from a seed set of documents
Fetched documents Handed over to an indexing system Can be discarded after indexing, or store as a cached copy
4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval
Browsing Storing related documents together in a library
facilitates browsing users can see not only requested document but also
related ones. Browsing is facilitated by classification system that
organizes logically related documents together. Organization is hierarchical: classification
hierarchy
4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval
A Classification Hierarchy For A Library System
4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval
Classification DAG
Documents can reside in multiple places in a hierarchy in an information retrieval system, since physical location is not important.
Classification hierarchy is thus Directed Acyclic Graph (DAG)
4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval
A Classification DAG For A Library Information Retrieval System
4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval
Web Directories
A Web directory is just a classification directory on Web pages E.g. Yahoo! Directory, Open Directory project Issues:
What should the directory hierarchy be? Given a document, which nodes of the directory are categories
relevant to the document Often done manually
Classification of documents into a hierarchy may be done based on term similarity
4/13/2005 Yan Huang - CSCI5330 Database Implementation – Information Retrieval
Some slides of this slide set adapted from the following slides:
Prof. James Allan’s course slides Extrapolation Methods for Accelerating PageRank Computations by
Sepandar D. Kamvar et. al.