Upload
shelby-conley
View
16
Download
1
Embed Size (px)
DESCRIPTION
CS 430: Information Discovery. Lecture 4 Ranking. Course Administration. •The slides for Lecture 3 have been reposted with slightly revised notation. •The reading for Discussion Class 2 requires a computer connected to the network with a Cornell IP address. - PowerPoint PPT Presentation
Citation preview
1
CS 430: Information Discovery
Lecture 4
Ranking
2
Course Administration
• The slides for Lecture 3 have been reposted with slightly revised notation.
• The reading for Discussion Class 2 requires a computer connected to the network with a Cornell IP address.
• Teaching assistants do not have office hours. If your query cannot be addressed by email, ask to meet with them or come to my office hours.
• Assignment 1 is an individual assignment. Discuss the concepts and the choice of methods with your colleagues, but the actual programs and report much be individual work.
3
Choice of Weights
ant bee cat dog eel fox gnu hog
q ? ? d1 ? ? d2 ? ? ? ? d3 ? ? ? ? ?
queryq ant dogdocument text termsd1 ant ant bee ant bee
d2 dog bee dog hog dog ant dog ant bee dog hog
d3 cat gnu dog eel fox cat dog eel fox gnu
What weights lead to the best information retrieval?
4
Methods for Selecting Weights
Empirical
Test a large number of possible weighting schemes with actual data. (This lecture, based on work of Salton, et al.)
Model based
Develop a mathematical model of word distribution and derive weighting scheme theoretically. (Probabilistic model of information retrieval.)
5
Weighting1. Term Frequency
Suppose term j appears fij times in document i. What weighting should be given to a term j?
Term Frequency: Concept
A term that appears many times within a document is likely to be more important than a term that appears only once.
6
Term Frequency: Free-text Document
Length of document
Simple method (as illustrated in Lecture 3) is to use fij as the term frequency.
...but, in free-text documents, terms are likely to appear more often in long documents. Therefore fij should be scaled by some variable related to document length.
i
7
Term Frequency: Free-text Document
Standard method for free-text documents
Scale fij relative to the frequency of other terms in the document. This partially corrects for variations in the length of the documents.
Let mi = max (fij) i.e., mi is the maximum frequency of any term in document i
Term frequency (tf):
tfij = fij / mi when fij > 0
Note: There is no special justification for taking this form of term frequency except that it works well in practice and is easy to calculate.
i
8
Weighting2. Inverse Document Frequency
Suppose term j appears fij times in document i. What weighting should be given to a term j?
Inverse Document Frequency: Concept
A term that occurs in a few documents is likely to be a better discriminator that a term that appears in most or all documents.
9
Inverse Document Frequency
Suppose there are n documents and that the number of documents in which term j occurs is nj.
A possible method might be to use n/nj as the inverse document frequency.
Standard method
The simple method over-emphasizes small differences. Therefore use a logarithm.
Inverse document frequency (idf):
idfj = log2 (n/nj) + 1 nj > 0
Note: There is no special justification for taking this form of inverse document frequency except that it works well in practice and is easy to calculate.
10
Example of Inverse Document Frequency
Example
n = 1,000 documents
term j nj idfj
A 100 4.32 B 500 2.00 C 900 1.13 D 1,000 1.00
From: Salton and McGill
11
Full Weighting: Standard Form of tf.idf
Practical experience has demonstrated that weights of the following form perform well in a wide variety of circumstances:
(weight of term j in document i)
= (term frequency) * (inverse document frequency)
The standard tf.idf weighting scheme, for free text documents, is:
tij = tfij * idfj
= (fij / mi) * (log2 (n/nj) + 1) when nj > 0
12
Structured Text
Structured text
Structured texts, e.g., queries or catalog records, have different distribution of terms from free-text. A modified expression for the term frequency is:
tfij = K + (1 - K)*fij / mi when fij > 0
K is a parameter between 0 and 1 that can be tuned for a specific collection.
Query
To weigh terms in the query, Salton and Buckley recommend K equal to 0.5.
i
13
Similarity
The similarity between query q and document i is given by:
tqktik
|dq| |di|
Where dq and di are the corresponding weighted term vectors, with components in the k dimension (corresponding to term k) given by:
tqk = (0.5 + 0.5*fqk / mq)*(log2 (n/nk) + 1) when fqk > 0
tik = (fik / mi) * (log2 (n/nk) + 1) when fik > 0
k=1
n
cos(dq, di) =
14
Boolean Queries
Boolean query: two or more search terms, related by logical operators, e.g.,
and or not
Examples:
abacus and actor
abacus or actor
(abacus and actor) or (abacus and atoll)
not actor
15
Boolean Diagram
A B
A and B
A or B
not (A or B)
16
Adjacent and Near Operators
abacus adj actor
Terms abacus and actor are adjacent to each other as in the string
"abacus actor"
abacus near 4 actor
Terms abacus and actor are near to each other as in the string
"the actor has an abacus"
Some systems support other operators, such as with (two terms in the same sentence) or same (two terms in the same paragraph).
17
Evaluation of Boolean Operators
Precedence of operators must be defined:
adj, near high
and, not
or low
Example
A and B or C and B
is evaluated as
(A and B) or (C and B)
18
Inverted File
Inverted file:
A list of search terms that are used to index a set of documents. The inverted file is organized for associative look-up, i.e., to answer the question, "In which documents does a specified search term appear?"
In practical applications, the inverted file contains related information, such as the location within the document where the search terms appear.
19
Inverted File -- Basic Concept
Word Document
abacus 3
19 22
actor 2 19 29
aspen 5 atoll 11
34
Stop words are removed and stemming carried out before building the index.
20
Inverted List -- Concept
Inverted List: All the entries in an inverted file that apply to a specific word, e.g.
abacus 3 19 22
Posting: Entry in an inverted list, e.g., there are three postings for "abacus".
21
Evaluating a Boolean Query
3 19 22 2 19 29
To evaluate the and operator, merge the two inverted lists
with a logical AND operation.
Examples: abacus and actor
Postings for abacus
Postings for actor
Document 19 is the only document that contains both terms, "abacus" and "actor".
22
Enhancements to Inverted Files -- Concept
Location: The inverted file holds information about the location of each term within the document.
Uses
adjacency and near operatorsuser interface design -- highlight location of search term
Frequency: The inverted file includes the number of postings for each term.
Uses
term weightingquery processing optimization
23
Inverted File -- Concept (Enhanced)
Word Postings Document Location
abacus 4 3 94 19 7 19 212
22 56actor 3 2 66
19 213 29 45
aspen 1 5 43atoll 3 11 3
11 70 34 40
24
Evaluating an Adjacency Operation
Examples: abacus adj actor
Postings for abacus
Postings for actor
Document 19, locations 212 and 213, is the only occurrence of the terms "abacus" and "actor" adjacent.
3 94 19 719 212 22 56
2 66 19 213 29 45