46
Special Topics in Information Retrieval Manuel Montes, Aurelio López http://ccc.inaoep.mx/~mmontesg/ [email protected], [email protected] INAOE, January-May 2020

Special Topics on Information Retrievalccc.inaoep.mx/~mmontesg/cursos/RecuperacionInformacion/...relevant documents P i rel i AP N i _ 1 u N is the number of retrieved documents, P(i)

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Special Topics on Information Retrievalccc.inaoep.mx/~mmontesg/cursos/RecuperacionInformacion/...relevant documents P i rel i AP N i _ 1 u N is the number of retrieved documents, P(i)

Special Topics inInformation Retrieval

Manuel Montes, Aurelio López

http://ccc.inaoep.mx/~mmontesg/

[email protected], [email protected]

INAOE, January-May 2020

Page 2: Special Topics on Information Retrievalccc.inaoep.mx/~mmontesg/cursos/RecuperacionInformacion/...relevant documents P i rel i AP N i _ 1 u N is the number of retrieved documents, P(i)

Introduction

Page 3: Special Topics on Information Retrievalccc.inaoep.mx/~mmontesg/cursos/RecuperacionInformacion/...relevant documents P i rel i AP N i _ 1 u N is the number of retrieved documents, P(i)

Content of the section

• Definition of the task

• The vector space model

• Language models for IR

• Performance evaluation

• Main problems and basic solutions

– Query expansion

– Relevance feedback

– Clustering (documents or results)

3Special Topics on Information Retrieval

Page 4: Special Topics on Information Retrievalccc.inaoep.mx/~mmontesg/cursos/RecuperacionInformacion/...relevant documents P i rel i AP N i _ 1 u N is the number of retrieved documents, P(i)

Initial questions

• What is an information retrieval system?

• What is its goal?

• What is inside it? Any sub-procceses?

• How to evaluate its performance?

• Why results are not always relevant?

Special Topics on Information Retrieval4

Page 5: Special Topics on Information Retrievalccc.inaoep.mx/~mmontesg/cursos/RecuperacionInformacion/...relevant documents P i rel i AP N i _ 1 u N is the number of retrieved documents, P(i)

First definition

“Information retrieval (IR) embraces the intellectualaspects of the description of information and itsspecification for search, and also whatever systems,techniques, or machines are employed to carry out theoperation”

Calvin Mooers, 1951

Special Topics on Information Retrieval5

Page 6: Special Topics on Information Retrievalccc.inaoep.mx/~mmontesg/cursos/RecuperacionInformacion/...relevant documents P i rel i AP N i _ 1 u N is the number of retrieved documents, P(i)

General scheme of the IR process

Special Topics on Information Retrieval6

Task

InfoNeed

Query

Results

Conception

Formulation

Search RefinementCorpus

User

Page 7: Special Topics on Information Retrievalccc.inaoep.mx/~mmontesg/cursos/RecuperacionInformacion/...relevant documents P i rel i AP N i _ 1 u N is the number of retrieved documents, P(i)

More definitions

• IR deals with the representation, storage, organization of, and access to information items.

R. Baeza-Yates and B. Ribeiro-Neto, 1999

• The task of an IR system is to retrieve documents or texts with information content that is relevant to a user’s information need.

Spark Jones & Willett, 1997

Special Topics on Information Retrieval7

Page 8: Special Topics on Information Retrievalccc.inaoep.mx/~mmontesg/cursos/RecuperacionInformacion/...relevant documents P i rel i AP N i _ 1 u N is the number of retrieved documents, P(i)

Typical IR system

Special Topics on Information Retrieval8

Preprocessing Storing

Index

Querying

Retrieving

Query

Results

DocumentCollection

Indexing Retrieval IR Model

Page 9: Special Topics on Information Retrievalccc.inaoep.mx/~mmontesg/cursos/RecuperacionInformacion/...relevant documents P i rel i AP N i _ 1 u N is the number of retrieved documents, P(i)

Vector space model

• Documents are represented as vectors in a N-dimensional space

– N is the number of terms in the collection

– Term is different than word

• Query is treated as any other document

• Relevance – measured by similarity:

– A document is relevant to the query if its vector is similar to the query’s vector .

Special Topics on Information Retrieval9

Page 10: Special Topics on Information Retrievalccc.inaoep.mx/~mmontesg/cursos/RecuperacionInformacion/...relevant documents P i rel i AP N i _ 1 u N is the number of retrieved documents, P(i)

Preprocessing

• Eliminate information about style, such as html or xml tags.– For some applications this information may be useful. For instance,

only index some document sections.

• Remove stop words– Functional words such as articles, prepositions, conjunctions are not

useful (do not have an own meaning).

• Perform stemming or lemmatization– The goal is to reduce inflectional forms, and sometimes derivationally

related forms.

Special Topics on Information Retrieval10

am, are, is → be car, cars, car‘s → car

Page 11: Special Topics on Information Retrievalccc.inaoep.mx/~mmontesg/cursos/RecuperacionInformacion/...relevant documents P i rel i AP N i _ 1 u N is the number of retrieved documents, P(i)

Representation

t1 t1 … tn

d1

d2

: wi,j

dm

Special Topics on Information Retrieval11

All documents(one vector per document)

Weight indicating the contributionof term j in document i.

Whole vocabulary of the collection(all different terms)

Page 12: Special Topics on Information Retrievalccc.inaoep.mx/~mmontesg/cursos/RecuperacionInformacion/...relevant documents P i rel i AP N i _ 1 u N is the number of retrieved documents, P(i)

Term weighting - two main ideas

• The importance of a term increases proportionally to the number of times it appears in the document.

– It helps to describe document’s content.

• The general importance of a term decreases proportionally to its occurrences in the entire collection.

– Common terms are not good to distinguish relevant from non-relevant documents

Special Topics on Information Retrieval12

Page 13: Special Topics on Information Retrievalccc.inaoep.mx/~mmontesg/cursos/RecuperacionInformacion/...relevant documents P i rel i AP N i _ 1 u N is the number of retrieved documents, P(i)

Term weighting – main approaches

• Binary weights: – wi,j = 1 iff document di contains term tj , otherwise 0.

• Term frequency (tf):– wi,j = (no. of occurrences of tj in di)

• tf x idf weighting scheme:

– wi,j = tf(tj, di) × idf(tj), where:

• tf(tj, di) indicates the ocurrences of tj in document di

• idf(tj) = log [N/df(tj)], where df(tj) is the number of documets that

contain the term tj.

Special Topics on Information Retrieval13

Page 14: Special Topics on Information Retrievalccc.inaoep.mx/~mmontesg/cursos/RecuperacionInformacion/...relevant documents P i rel i AP N i _ 1 u N is the number of retrieved documents, P(i)

Similarity measure

• Relevance – similarity between document’s vectors and the query’s vector.

• Measured by means of the cosine measure.

– The closer the vectors (small their angle), the greater the

document similarity.

Special Topics on Information Retrieval14

=

•=

i

id

i

iq

i

idiq

ww

ww

dq

dqdqsim

22||||),(

i

j

a1

d1

q

d2a2

Page 15: Special Topics on Information Retrievalccc.inaoep.mx/~mmontesg/cursos/RecuperacionInformacion/...relevant documents P i rel i AP N i _ 1 u N is the number of retrieved documents, P(i)

Vector space model − Pros & Cons

• Pros

– Easily explain

– Mathematically sound

– Approximate query matching

• Cons

– Need term weighting

– Hard to model structured queries

– Normalization increases computational costs

• Most commonly used IR model; it is considered superior to others due to its simplicity and elegancy.

Special Topics on Information Retrieval15

Page 16: Special Topics on Information Retrievalccc.inaoep.mx/~mmontesg/cursos/RecuperacionInformacion/...relevant documents P i rel i AP N i _ 1 u N is the number of retrieved documents, P(i)

Other IR models

• Boolean model (~1950)

• Probabilistic indexing (~1960)

• Vector space model (~1970)

• Probabilistic retrieval (~1976)

• Fuzzy set models (~1980)

• Inference networks (~1992)

• Language models (~1998)

– A language model is build for each document, and the likelihood that the document will generate the query is computed.

Special Topics on Information Retrieval16

Page 17: Special Topics on Information Retrievalccc.inaoep.mx/~mmontesg/cursos/RecuperacionInformacion/...relevant documents P i rel i AP N i _ 1 u N is the number of retrieved documents, P(i)

Language models

• A statistical language model assigns a probability to a sequence of m words by means of a probability distribution.

• The simplest form of LM simply throws away all conditioning context. It estimates each term independently.

– It is essentially a word distribution and it is called a unigram language model.

Special Topics on Information Retrieval17

Page 18: Special Topics on Information Retrievalccc.inaoep.mx/~mmontesg/cursos/RecuperacionInformacion/...relevant documents P i rel i AP N i _ 1 u N is the number of retrieved documents, P(i)

Some applications of LMs

• Language models are useful or tasks like speech recognition, spelling correction, and machine translation.

• In these tasks it is important to consider the probability of a term conditioned on its surrounding context in order to improve an initial recognition or translation.

– What is the best transcription, “I ate a roast beef” or “I ate a roast riff”?

How to use language models in the IR process?

Special Topics on Information Retrieval18

Page 19: Special Topics on Information Retrievalccc.inaoep.mx/~mmontesg/cursos/RecuperacionInformacion/...relevant documents P i rel i AP N i _ 1 u N is the number of retrieved documents, P(i)

Language models for IR

• The core idea is that documents can be ranked on their likelihood of generating the query.

Special Topics on Information Retrieval19

:

d1

d2

dn

:

DocumentRanking

Md1

Md1

Md1

P(Q|Mdi)

Query (Q)

Retrieveddocuments

Page 20: Special Topics on Information Retrievalccc.inaoep.mx/~mmontesg/cursos/RecuperacionInformacion/...relevant documents P i rel i AP N i _ 1 u N is the number of retrieved documents, P(i)

Query likelihood language models

• Goal is to rank documents by P(d|Q), where the probability of a document is interpreted as the likelihood that it is relevant to the query.

What happen if a query term is not in the document?

How to solve this problem?

Special Topics on Information Retrieval20

( )( ) ( )

( )( ) ( )dii

ii

i MQpdQpQp

dpdQpQdp =

Page 21: Special Topics on Information Retrievalccc.inaoep.mx/~mmontesg/cursos/RecuperacionInformacion/...relevant documents P i rel i AP N i _ 1 u N is the number of retrieved documents, P(i)

Query likelihood language models (2)

• The solution is to apply a smoothing approach

• One way is to use a mixture between document specific and collection based distributions.

What controls the parameter ?

How can we compute the distribution Mc?

What can we do/define with the prior p(d)?

Special Topics on Information Retrieval21

Page 22: Special Topics on Information Retrievalccc.inaoep.mx/~mmontesg/cursos/RecuperacionInformacion/...relevant documents P i rel i AP N i _ 1 u N is the number of retrieved documents, P(i)

IR evaluation

• Why evaluation is important?

• Which characteristics we need to evaluate?

• How can we evaluate the performance of IR systems?

– Given several systems, which one is the best?

• What things (resources) are necessary to evaluate an IR system?

• Is IR evaluation subjective or objective?

Special Topics on Information Retrieval22

Page 23: Special Topics on Information Retrievalccc.inaoep.mx/~mmontesg/cursos/RecuperacionInformacion/...relevant documents P i rel i AP N i _ 1 u N is the number of retrieved documents, P(i)

Several perspectives

• In order to answer "How well does the system

work?“, we can investigated several options:

– Processing: Time and space efficiency

– Search: Effectiveness of results

– System: Satisfaction of the user

• We will focus on evaluating retrieval effectiveness,

– How to measure the other aspects?

Special Topics on Information Retrieval23

Page 24: Special Topics on Information Retrievalccc.inaoep.mx/~mmontesg/cursos/RecuperacionInformacion/...relevant documents P i rel i AP N i _ 1 u N is the number of retrieved documents, P(i)

Difficulties in evaluating an IR system

• Effectiveness is related to the relevancy of retrieved items.– Relevancy is not typically binary but continuous.

– Even if relevancy is binary, it can be a difficult judgment to make.

• Relevancy, from a human standpoint, is:– Subjective: Depends upon a specific user’s judgment.

– Situational: Relates to user’s current needs.

– Cognitive: Depends on human perception and behavior.

– Dynamic: Changes over time.

Special Topics on Information Retrieval24

Page 25: Special Topics on Information Retrievalccc.inaoep.mx/~mmontesg/cursos/RecuperacionInformacion/...relevant documents P i rel i AP N i _ 1 u N is the number of retrieved documents, P(i)

Evaluation through a user study(From Jing He slides on IR evaluation; University of Montreal, 2012)

• Process:

– Actual users are hired

– They use the systems to complete some tasks

– They report their subjective feeling

• Strength:

– Close to real

• Weaknesses:

– Too subjective

– Too expensive, small scale, bias

Special Topics on Information Retrieval25

Page 26: Special Topics on Information Retrievalccc.inaoep.mx/~mmontesg/cursos/RecuperacionInformacion/...relevant documents P i rel i AP N i _ 1 u N is the number of retrieved documents, P(i)

Cranfield Paradigm (test collection)

• It is necessary to have a test collection

– A lot of documents (the bigger the better)

– Several queries

– Relevance judgments for all queries• Binary assessment of either relevant or non relevant for each

query-document pair.

• Methods/systems must be evaluated using the same evaluation measure.

• Because the test collection is reusable:

– Cheaper than user study

– Easy for error analysis

Special Topics on Information Retrieval26

Page 27: Special Topics on Information Retrievalccc.inaoep.mx/~mmontesg/cursos/RecuperacionInformacion/...relevant documents P i rel i AP N i _ 1 u N is the number of retrieved documents, P(i)

Constructing a test collection(From Jing He slides on IR evaluation; University of Montreal, 2012)

Special Topics on Information Retrieval27

Page 28: Special Topics on Information Retrievalccc.inaoep.mx/~mmontesg/cursos/RecuperacionInformacion/...relevant documents P i rel i AP N i _ 1 u N is the number of retrieved documents, P(i)

Standard test collections

• TREC (Text Retrieval Conference)

– National Institute of Standards and Technology

– In total, 1.89 million documents and relevance judgments for 450 information needs

• CLEF (Cross Language Evaluation Forum)

– This evaluation series has concentrated on European languages and cross-language information retrieval

– Last Adhoc English monolingual IR task: 169,477 documents and 50 queries.

• NTCIR (NII Test Collection for IR Systems)

– East Asian languages

Special Topics on Information Retrieval28

Page 29: Special Topics on Information Retrievalccc.inaoep.mx/~mmontesg/cursos/RecuperacionInformacion/...relevant documents P i rel i AP N i _ 1 u N is the number of retrieved documents, P(i)

Retrieval effectiveness

In response to a query, an IR system searches a document collection and returns a ordered list of responses.

• Measure the quality of a set/list of responses

– a better search strategy yields a better result list

– Better result lists help the user fill their information need

• Two kinds of measures:

– set based and ranked-list based

Special Topics on Information Retrieval29

Page 30: Special Topics on Information Retrievalccc.inaoep.mx/~mmontesg/cursos/RecuperacionInformacion/...relevant documents P i rel i AP N i _ 1 u N is the number of retrieved documents, P(i)

Set based measures

Special Topics on Information Retrieval30

Collection

QueryRelevant

DocumentsRetrievedDocuments

RelevantRetrieved

documentsResults

documents relevant of number Total

retrieved documents relevant of Number recall =

retrieved documents of number Total

retrieved documents relevant of Number precision =

Page 31: Special Topics on Information Retrievalccc.inaoep.mx/~mmontesg/cursos/RecuperacionInformacion/...relevant documents P i rel i AP N i _ 1 u N is the number of retrieved documents, P(i)

Precision, recall and F-measure

• Precision (P)

– The ability to retrieve top-ranked documents that are mostly relevant.

• Recall (R)

– The ability of the search to find all of the relevant items in the corpus.

• F-measure (F)

– Harmonic mean of recall and precision

31

RP

PRF

+=

2

Page 32: Special Topics on Information Retrievalccc.inaoep.mx/~mmontesg/cursos/RecuperacionInformacion/...relevant documents P i rel i AP N i _ 1 u N is the number of retrieved documents, P(i)

Ranked-list based measures

• P@N

– Precision at the n-th position in the ranking of results

• MRR (mean reciprocal rank)

– Reciprocal of rank of the first relevant doc

• Average Recall/Precision Curve

– Plots average precision at each standard recall level across all queries.

• MAP (mean average precision)

– Provides a single-figure measure of quality across recall levels

Special Topics on Information Retrieval32

Page 33: Special Topics on Information Retrievalccc.inaoep.mx/~mmontesg/cursos/RecuperacionInformacion/...relevant documents P i rel i AP N i _ 1 u N is the number of retrieved documents, P(i)

Some notes on P@N and MRR(https://medium.com/swlh/rank-aware-recsys-evaluation-metrics-5191bba16832)

• P@N calculates the fraction of n results that are good. It does nor consider the result list as an ordered list. – It treats all errors in the first n positions equally

• MRR focuses on one single ítem from the list. It gives a list with a single relevant item just a much weight as a list with many relevant items. – Good for known-item search such as navigational queries or looking

for a fact.

Special Topics on Information Retrieval33

P@4

2

2

1

MRR

1/3 + 1/4

1/2 + 1/3

1/1

List 1

List 2

List 3

Page 34: Special Topics on Information Retrievalccc.inaoep.mx/~mmontesg/cursos/RecuperacionInformacion/...relevant documents P i rel i AP N i _ 1 u N is the number of retrieved documents, P(i)

Recall/Precision Curve(from Mooney’s IR course at the University of Texas at Austin)

Special Topics on Information Retrieval34

0

0.2

0.4

0.6

0.8

1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Recall

Precision

NoStem Stem

What is the curve of an ideal system?

Page 35: Special Topics on Information Retrievalccc.inaoep.mx/~mmontesg/cursos/RecuperacionInformacion/...relevant documents P i rel i AP N i _ 1 u N is the number of retrieved documents, P(i)

MAP

• Average precision is the average of the precision scores at the rank locations of each relevant document.

• Mean Average Precision is the mean of the Average Precision scores for a group of queries.

Special Topics on Information Retrieval35

( ) ( )

documentsrelevant

ireliP

AP

N

i

_1

==

N is the number of retrieved documents, P(i) is the precision of the first idocuments, and rel(i) is a binary function indicating if document at i-position is relevant or not.

Page 36: Special Topics on Information Retrievalccc.inaoep.mx/~mmontesg/cursos/RecuperacionInformacion/...relevant documents P i rel i AP N i _ 1 u N is the number of retrieved documents, P(i)

Illustrative example(from IR course of Northeastern University, College of Computer and Information Science)

Special Topics on Information Retrieval36

MAP1 = 0.622

MAP2 = 0.52

Page 37: Special Topics on Information Retrievalccc.inaoep.mx/~mmontesg/cursos/RecuperacionInformacion/...relevant documents P i rel i AP N i _ 1 u N is the number of retrieved documents, P(i)

Analysis of MAP(https://medium.com/swlh/rank-aware-recsys-evaluation-metrics-5191bba16832)

The AP metric represents the area under the precision-recall curve

Special Topics on Information Retrieval37

Page 38: Special Topics on Information Retrievalccc.inaoep.mx/~mmontesg/cursos/RecuperacionInformacion/...relevant documents P i rel i AP N i _ 1 u N is the number of retrieved documents, P(i)

Common problems

• Why not all retrieved documents are relevant?, why

it is too difficult to get 100% of precision?

– Consider the query “jaguar”

• Why it is complex to retrieve all relevant documents

(get 100% of recall)?

– Consider the query “religion”

• What to do in order to tackle these problems?

Special Topics on Information Retrieval38

Page 39: Special Topics on Information Retrievalccc.inaoep.mx/~mmontesg/cursos/RecuperacionInformacion/...relevant documents P i rel i AP N i _ 1 u N is the number of retrieved documents, P(i)

Query expansion

• It is the process of adding terms to a user’s (weighted) query.

• Its goal is to improve precision and/or recall.

• Example:

– User Query: “car”

– Expanded Query: “car cars automobile automobiles auto” etc…

• How to do it? Ideas?

Special Topics on Information Retrieval39

Page 40: Special Topics on Information Retrievalccc.inaoep.mx/~mmontesg/cursos/RecuperacionInformacion/...relevant documents P i rel i AP N i _ 1 u N is the number of retrieved documents, P(i)

Main approaches

1. By means of a thesaurus

– Thesauri may be manually or automatically constructed.

2. By means of (user) relevance feedback

3. Automatic query expansion

– Local query expansion (blind feedback)

– Global query expansion (using word associations)

Special Topics on Information Retrieval40

Page 41: Special Topics on Information Retrievalccc.inaoep.mx/~mmontesg/cursos/RecuperacionInformacion/...relevant documents P i rel i AP N i _ 1 u N is the number of retrieved documents, P(i)

Thesaurus-based query expansion

A thesaurus provides information on synonyms and semantically related words

• Expansion procedure: For each term t in a query, expand the query with synonyms and related words of t.

• Generally increases recall.

• May significantly decrease precision, particularly with ambiguous terms.– “interest rate” → “interest rate fascinate evaluate”

Special Topics on Information Retrieval41

Page 42: Special Topics on Information Retrievalccc.inaoep.mx/~mmontesg/cursos/RecuperacionInformacion/...relevant documents P i rel i AP N i _ 1 u N is the number of retrieved documents, P(i)

Relevance feedback

• Basic procedure:

1. The user creates their initial query which returns an initial result set.

2. The user selects a list of documents that are relevant to their search.

3. The system then re-weights and/or expands the query based upon the terms in the documents

• Significant improvement in recall and precision over early query expansion work

Special Topics on Information Retrieval42

Page 43: Special Topics on Information Retrievalccc.inaoep.mx/~mmontesg/cursos/RecuperacionInformacion/...relevant documents P i rel i AP N i _ 1 u N is the number of retrieved documents, P(i)

Standard Rocchio Method

• The idea is to move the query in direction closer to the relevant documents, and farther away from the irrelevant ones.

Special Topics on Information Retrieval43

−+=

njrj Dd

j

nDd

j

r

m dD

dD

qq

a

a: Tunable weight for initial query: Tunable weight for relevant documents: Tunable weight for irrelevant documents

Page 44: Special Topics on Information Retrievalccc.inaoep.mx/~mmontesg/cursos/RecuperacionInformacion/...relevant documents P i rel i AP N i _ 1 u N is the number of retrieved documents, P(i)

Pseudo relevance feedback

Users do not like to givemanual feedback to the system

• Use relevance feedback methods withoutexplicit user input.

• Just assume the top m retrieved documents are relevant, and use them to reformulate the query.

• Relies largely on the systems ability to initially retrieve relevant documents.

Special Topics on Information Retrieval44

Page 45: Special Topics on Information Retrievalccc.inaoep.mx/~mmontesg/cursos/RecuperacionInformacion/...relevant documents P i rel i AP N i _ 1 u N is the number of retrieved documents, P(i)

Automatic global analysis

• Determine term similarity through a pre-computed statistical analysis of the complete corpus.

– Compute association matrices which quantify term correlations in terms of how frequently they co-occur.

• Expand queries with statistically most similar terms.

– The same information for all queries. It is an offline process

Special Topics on Information Retrieval45

Page 46: Special Topics on Information Retrievalccc.inaoep.mx/~mmontesg/cursos/RecuperacionInformacion/...relevant documents P i rel i AP N i _ 1 u N is the number of retrieved documents, P(i)

Clustering in information retrieval

• Cluster hypothesis: Documents in the same cluster behave similarly with respect to relevance to information needs.– If there is a document from a cluster that is relevant to a

search request, then it is likely that other documents from the same cluster are also relevant.

• Two main uses:– Collection clustering

• Higher efficiency: faster search

• Tends to improves recall

– Search results clustering• More effective information presentation to user

Special Topics on Information Retrieval46