Upload
quentin-dixon
View
213
Download
0
Tags:
Embed Size (px)
Citation preview
Basic Implementation and Evaluations
Aj. Khuanlux Mitsophonsiri CS.426 INFORMATION RETRIEVAL
2
Simple Tokenizing
Analyze text into a sequence of discrete tokens (words)
Sometimes punctuation (e-mail), numbers (1999), and case (Republican vs. republican) can be a meaningful part of a token
However, frequently they are not Simplest approach is to ignore all numbers
and punctuation and use only case-insensitive unbroken strings of alphabetic characters as tokens
3
Tokenizing HTML
Should text in HTML commands not typically seen by the user be included as tokens?Words appearing in URLsWords appearing in “meta text” of images
Simplest approach is to exclude all HTML tag information (between “<“ and “>”) from tokenization
4
Stopwords
It is typical to exclude high-frequency words (e.g. function words: “a”, “the”, “in”, “to”; pronouns: “I”, “he”, “she”, “it”)
Stopwords are language dependent. VSR uses a standard set of about 500 for English
For efficiency, store strings for stopwords in a hashtable to recognize them in constant time
5
Stemming
Reduce tokens to “root” form of words to recognize morphological variation. “computer”, “computational”, “computation”
all reduced to same token “compute”Correct morphological analysis is
language specific and can be complexStemming “blindly” strips off known
affixes (prefixes and suffixes) in an iterative fashion
6
Porter Stemmer
Simple procedure for removing known affixes in English without using a dictionary
Can produce unusual stems that are not English words: “computer”, “computational”, “computation” all
reduced to same token “compute”
May conflate (reduce to the same token) words that are actually distinct
Not recognize all morphological derivations
7
Porter Stemmer Errors
Errors of “commission”:organization, organ organpolice, policy policarm, army arm
Errors of “omission”:cylinder, cylindricalcreate, creationEurope, European
Evaluation
9
Why System Evaluation?
There are many retrieval models, algorithms, systems, which one is the best?
What is the best component for: Ranking function (cosine, …) Term selection (stopword removal, stemming…) Term weighting (TF, TF-IDF,…)
How far down the ranked list will a user need to look to find some/all relevant documents?
10
Difficulties in Evaluating IR Systems
Effectiveness is related to the relevancy of retrieved items
Even if relevancy is binary, it can be a difficult judgment to make
Relevancy, from a human standpoint, is: Subjective: Depends upon a specific user’s judgment Situational: Relates to user’s current needs Cognitive: Depends on human perception and behavior Dynamic: Changes over time
11
Human Labeled Corpora
Start with a corpus of documents Collect a set of queries for this corpus Have one or more human experts exhaustively
label the relevant documents for each query Typically assumes binary relevance judgments Requires considerable human effort for large
document/query corpora
12
Precision and Recall
PrecisionThe ability to retrieve top-ranked documents
that are mostly relevantRecall
The ability of the search to find all of the relevant items in the corpus
13
Precision and Recall
documents relevant of number Total
retrieved documents relevant of Number recall
Relevant documents
Retrieved documents
Entire document collection
retrieved & relevant
not retrieved but relevant
retrieved & irrelevant
Not retrieved & irrelevant
retrieved not retrieved
rele
vant
irre
leva
nt
retrieved documents of number Total
retrieved documents relevant of Number precision
14
Determining Recall is Difficult
Total number of relevant items is sometimes not available: Sample across the database and perform relevance
judgment on these items Apply different retrieval algorithms to the same
database for the same query. The aggregate of relevant items is taken as the total relevant set
15
Trade-off between Recall and Precision
10
1
Recall
Pre
cisi
on
The idealReturns relevant documents butmisses many useful ones too
Returns most relevantdocuments but includes lots of junk
16
Computing Recall/Precision Points
For a given query, produce the ranked list of retrievals
Adjusting a threshold on this ranked list produces different sets of retrieved documents, and therefore different recall/precision measures
Mark each document in the ranked list that is relevant Compute a recall/precision pair for each position in the
ranked list that contains a relevant document
17
Common Representation
Relevant = A+C Retrieved = A+B Collections size = A+B+C+D Precision = A/(A+B) Recall = A/(A+C) Miss = C/(A+C) False alarm = B/(B+D)
Relevant Not Relevant
Retrieved A B
Not Retrieved C D
18
Precision and Recall Example
<- Relevant documents
Recall 0.2 0.2 0.4 0.4 0.4 0.6 0.6 0.6 0.8 1.0
Precision 1.0 0.5 0.67 0.5 0.4 0.5 0.43 0.38 0.44 0.5
Recall 0.0 0.2 0.2 0.2 0.4 0.6 0.8 1.0 1.0 1.0
Precision 0.0 0.5 0.33 0.25 0.4 0.5 0.57 0.63 0.55 0.5
Ranking 1
Ranking 2
19
Average Precision of a Query
Often want a single number effectiveness measure E.g. for machine learning algorithm to detect improvement
Average precision is widely use in IR Calculate by averaging when recall increases
Recall 0.2 0.2 0.4 0.4 0.4 0.6 0.6 0.6 0.8 1.0
Precision 1.0 0.5 0.67 0.5 0.4 0.5 0.43 0.38 0.44 0.5
Recall 0.0 0.2 0.2 0.2 0.4 0.6 0.8 1.0 1.0 1.0
Precision 0.0 0.5 0.33 0.25 0.4 0.5 0.57 0.63 0.55 0.5
Average precision53.2 %
Average precision42.3 %
20
Average Recall/Precision Curve
Typically average performance over a large set of queries
Compute average precision at each standard recall level across all queries
Plot average precision/recall curves to evaluate overall system performance on a document/query corpus
21
Compare Two or More Systems
The curve closest to the upper right-hand corner of the graph indicates the best performance
0
0.2
0.4
0.6
0.8
1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
Precision
NoStem Stem
22
Fallout Rate
Problems with both precision and recall: Number of irrelevant documents in the collection
is not taken into account Recall is undefined when there is no relevant
document in the collection Precision is undefined when no document is
retrieved
collection the in items tnonrelevan of no. totalretrieved items tnonrelevan of no.
Fallout
23
Subjective Relevance Measure
Novelty Ratio: The proportion of items retrieved and judged relevant by the user and of which they were previously unaware Ability to find new information on a topic
Coverage Ratio: The proportion of relevant items retrieved out of the total relevant documents known to a user prior to the search Relevant when the user wants to locate documents
which they have seen before (e.g., the budget report for Year 2000)
24
Other Factors to Consider
User effort: Work required from the user in formulating queries, conducting the search, and screening the output
Response time: Time interval between receipt of a user query and the presentation of system responses
Form of presentation: Influence of search output format on the user’s ability to utilize the retrieved materials
Collection coverage: Extent to which any/all relevant items are included in the document corpus