Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL

Basic Implementation and Evaluations

Aj. Khuanlux Mitsophonsiri CS.426 INFORMATION RETRIEVAL

2

Simple Tokenizing

Analyze text into a sequence of discrete tokens (words)

Sometimes punctuation (e-mail), numbers (1999), and case (Republican vs. republican) can be a meaningful part of a token

However, frequently they are not Simplest approach is to ignore all numbers

and punctuation and use only case-insensitive unbroken strings of alphabetic characters as tokens

3

Tokenizing HTML

Should text in HTML commands not typically seen by the user be included as tokens?Words appearing in URLsWords appearing in “meta text” of images

Simplest approach is to exclude all HTML tag information (between “<“ and “>”) from tokenization

4

Stopwords

It is typical to exclude high-frequency words (e.g. function words: “a”, “the”, “in”, “to”; pronouns: “I”, “he”, “she”, “it”)

Stopwords are language dependent. VSR uses a standard set of about 500 for English

For efficiency, store strings for stopwords in a hashtable to recognize them in constant time

5

Stemming

Reduce tokens to “root” form of words to recognize morphological variation. “computer”, “computational”, “computation”

all reduced to same token “compute”Correct morphological analysis is

language specific and can be complexStemming “blindly” strips off known

affixes (prefixes and suffixes) in an iterative fashion

6

Porter Stemmer

Simple procedure for removing known affixes in English without using a dictionary

Can produce unusual stems that are not English words: “computer”, “computational”, “computation” all

reduced to same token “compute”

May conflate (reduce to the same token) words that are actually distinct

Not recognize all morphological derivations

7

Porter Stemmer Errors

Errors of “commission”:organization, organ organpolice, policy policarm, army arm

Errors of “omission”:cylinder, cylindricalcreate, creationEurope, European

Evaluation

9

Why System Evaluation?

There are many retrieval models, algorithms, systems, which one is the best?

What is the best component for: Ranking function (cosine, …) Term selection (stopword removal, stemming…) Term weighting (TF, TF-IDF,…)

How far down the ranked list will a user need to look to find some/all relevant documents?

10

Difficulties in Evaluating IR Systems

Effectiveness is related to the relevancy of retrieved items

Even if relevancy is binary, it can be a difficult judgment to make

Relevancy, from a human standpoint, is: Subjective: Depends upon a specific user’s judgment Situational: Relates to user’s current needs Cognitive: Depends on human perception and behavior Dynamic: Changes over time

11

Human Labeled Corpora

Start with a corpus of documents Collect a set of queries for this corpus Have one or more human experts exhaustively

label the relevant documents for each query Typically assumes binary relevance judgments Requires considerable human effort for large

document/query corpora

12

Precision and Recall

PrecisionThe ability to retrieve top-ranked documents

that are mostly relevantRecall

The ability of the search to find all of the relevant items in the corpus

13

Precision and Recall

documents relevant of number Total

retrieved documents relevant of Number recall

Relevant documents

Retrieved documents

Entire document collection

retrieved & relevant

not retrieved but relevant

retrieved & irrelevant

Not retrieved & irrelevant

retrieved not retrieved

rele

vant

irre

leva

nt

retrieved documents of number Total

retrieved documents relevant of Number precision

14

Determining Recall is Difficult

Total number of relevant items is sometimes not available: Sample across the database and perform relevance

judgment on these items Apply different retrieval algorithms to the same

database for the same query. The aggregate of relevant items is taken as the total relevant set

15

Trade-off between Recall and Precision

10

1

Recall

Pre

cisi

on

The idealReturns relevant documents butmisses many useful ones too

Returns most relevantdocuments but includes lots of junk

16

Computing Recall/Precision Points

For a given query, produce the ranked list of retrievals

Adjusting a threshold on this ranked list produces different sets of retrieved documents, and therefore different recall/precision measures

Mark each document in the ranked list that is relevant Compute a recall/precision pair for each position in the

ranked list that contains a relevant document

17

Common Representation

Relevant = A+C Retrieved = A+B Collections size = A+B+C+D Precision = A/(A+B) Recall = A/(A+C) Miss = C/(A+C) False alarm = B/(B+D)

Relevant Not Relevant

Retrieved A B

Not Retrieved C D

18

Precision and Recall Example

<- Relevant documents

Recall 0.2 0.2 0.4 0.4 0.4 0.6 0.6 0.6 0.8 1.0

Precision 1.0 0.5 0.67 0.5 0.4 0.5 0.43 0.38 0.44 0.5

Recall 0.0 0.2 0.2 0.2 0.4 0.6 0.8 1.0 1.0 1.0

Precision 0.0 0.5 0.33 0.25 0.4 0.5 0.57 0.63 0.55 0.5

Ranking 1

Ranking 2

19

Average Precision of a Query

Often want a single number effectiveness measure E.g. for machine learning algorithm to detect improvement

Average precision is widely use in IR Calculate by averaging when recall increases

Recall 0.2 0.2 0.4 0.4 0.4 0.6 0.6 0.6 0.8 1.0

Precision 1.0 0.5 0.67 0.5 0.4 0.5 0.43 0.38 0.44 0.5

Recall 0.0 0.2 0.2 0.2 0.4 0.6 0.8 1.0 1.0 1.0

Precision 0.0 0.5 0.33 0.25 0.4 0.5 0.57 0.63 0.55 0.5

Average precision53.2 %

Average precision42.3 %

20

Average Recall/Precision Curve

Typically average performance over a large set of queries

Compute average precision at each standard recall level across all queries

Plot average precision/recall curves to evaluate overall system performance on a document/query corpus

21

Compare Two or More Systems

The curve closest to the upper right-hand corner of the graph indicates the best performance

0

0.2

0.4

0.6

0.8

1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Recall

Precision

NoStem Stem

22

Fallout Rate

Problems with both precision and recall: Number of irrelevant documents in the collection

is not taken into account Recall is undefined when there is no relevant

document in the collection Precision is undefined when no document is

retrieved

collection the in items tnonrelevan of no. totalretrieved items tnonrelevan of no.

Fallout

23

Subjective Relevance Measure

Novelty Ratio: The proportion of items retrieved and judged relevant by the user and of which they were previously unaware Ability to find new information on a topic

Coverage Ratio: The proportion of relevant items retrieved out of the total relevant documents known to a user prior to the search Relevant when the user wants to locate documents

which they have seen before (e.g., the budget report for Year 2000)

24

Other Factors to Consider

User effort: Work required from the user in formulating queries, conducting the search, and screening the output

Response time: Time interval between receipt of a user query and the presentation of system responses

Form of presentation: Influence of search output format on the user’s ability to utilize the retrieved materials

Collection coverage: Extent to which any/all relevant items are included in the document corpus

Documents

Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL