Web Search - Summer Term 2006 II. Information Retrieval (Models, Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University

Web Search - Summer Term 2006

II. Information Retrieval (Models, Cont.)

(c) Wolfgang Hürst, Albert-Ludwigs-University

Classic Retrieval Models

1. Boolean Model (set theoretic)

2. Vector Model (algebraic)

3. Probabilistic Models (probabilistic)

Probabilistic IR ModelsBased on probability theory

Basic idea: Given a document d and a query q,Estimate the likelihood of d being relevant for the information need represented by q, i.e. P(R|q,d)

Compared to previous models:Boolean and Vector Models: Ranking based on relevance value which is inter- preted as a similarity measure between q and dProbabilistic Models: Ranking based on estimated likelihood of d being relevant for query q

Probabilistic ModelingGiven: Documents dj = (t1, t2, ..., tn), queries qi

(n = no of docs in collection)We assume similar dependence between d and q as

before, i.e. relevance depends on term distribution(Note: Slightly different notation here than before!)

Estimating P(R|d,q) directly often impossible in practice. Instead: Use Bayes Theorem, i.e.

or

Probab. Modeling as Decision Strategy

Decision about which docs should be returned based on a threshold calculated with a cost function Cj

Example: Cj (R, dec) Retrieved Not retrieved

Relevant Doc. 0 1

Non-Rel. Doc. 2 0

Decision based on risk function that minimizes costs:

Probability EstimationDifferent approaches to estimate P(d|R) exist:

Binary Independence Retrieval Model (BIR)Binary Independence Retrieval Model (BII)Darmstadt Indexing Approach (DIA)

Generally we assume stochastic independence between the terms of one document, i.e.

Binary Independence Retr. Model (BIR)

Learning:Estimation of probability distribution based on- a query qk

- a set of documents dj

- respective relevance judgments

Application:Generalization to different documentsfrom the collection(but restricted to same query and terms from training)

DOCS

TERMS

QUERIES

LEARNING APPLICATION

BIR

Binary Indep. Indexing Model (BII)

Learning:Estimation of probability distribution based on- a document dj

- a set of queries qk

- respective relevance judgments

Application:Generalization to different queries(but restricted to same doc. and terms from training)

DOCS

TERMS

QUERIES

APPLICA-TION

BII

LEARNING

Learning:Estimation of probability distribution based on- a set of queries qk

- an abstract description of a set of documents dj- respective relevance judgments

Application:Generalization to different queriesand documents

Darmstadt Indexing Approach (DIA)

DOCS

TERMS

QUERIES

APPLICA-TION

DIA

LEARNING

DIA - Description StepBasic idea: Instead of term-document pairs,

consider relevance descriptions x(ti, dm)

These contain the values of certain attributes of term ti, document dm and their relation to each other

Examples:- Dictionary information about ti (e.g. IDF)- Parameters describing dm (e.g. length or no. of unique terms)- Information about the appearance of ti in dm (e.g. in title, abstract), its frequency, the distance between two query terms, etc.

REFERENCE: FUHR, BUCKLEY [4]

DIA - Decision StepEstimation of probability P(R | x(ti, dm))

P(R | x(ti, dm)) is the probability of a document dm being relevant to an arbitrary query given that a term common to both document and query has a relevance description x(ti, dm).

Advantages:- Abstraction from specific term-doc pairs and thus generalization to random docs and queries- Enables individual, application-specific relevance descriptions

DIA - (Very) Simple Example

RELEVANCE DESCRIPTION:

x(ti, dm) = (x1, x2) withQUERY DOC. REL. TERM x

q1 d1 REL. t1

t2

t3

(1,1)

(0,1)

(1,2)

q1 d2 NOT REL.

t1

t3

t4

(0,2)

(1,1)

(0,1)

q2 d1 REL. t2

t5

t6

t7

(0,2)

(0,2)

(1,1)

(1,2)

q2 d3 NOT REL.

t5

t7

(0,1)

(0,1)

x Ex

(0,1) 1/4

(0,2) 2/3

(1,1) 2/3

(1,2) 1

TRAINING SET: q1, q2, d1, d2, d3

EVENT SPACE:

1, if ti title of dm

0, otherwise

1, if ti dm once2, if ti dm at least twice

x1 =

x2 =

DIA - Indexing FunctionBecause of relevance descriptions:

Generalization to random docs and queries

Another advantage: Instead of probabilities, we can also use a general indexing function e(x(ti, dm))

Note: We have a typical pattern recognition problem here, i.e.- Given: Set of features / parameters and different classes (here: rel. and not rel.)- Goal: Classification based on these featuresApproaches such as Neural Networks, SVMs, etc. can be used.

Models for IR - Taxonomy

Classic models:

Boolean model(based on set theory)

Vector space model (based on algebra)

Probabilistic models (based on probability theory)

Fuzzy set modelExtended Boolean model

Generalized vector modelLatent semantic indexingNeural networks

Inference networksBelief network

SOURCE: R. BAEZA-YATES [1], PAGE 20+21

Further models:Structured ModelsModels for BrowsingFiltering

References & Recommended Reading[1] R. BAEZA-YATES, B. RIBEIRO-NETO: MODERN IR, ADDISON

WESLEY, 1999CHAPTER 2-2.5 (IR MODELS), CH. 5 (RELEVANCE FEEDBACK)

[2] N. FUHR: SKRIPTUM ZUR VORLESUNG INFORMATION RETRIEVAL, AVAILABLE ONLINE AT THE COURSE HOME PAGE http://www.is.informatik.uni-duisburg.de/courses/ir_ss06/index.htmlOR DIRECTLY AThttp://www.is.informatik.uni-duisburg.de/courses/ir_ss06/folien/irskall.pdfCHAPTER 5.1-5.3, 5.5, 6 (IR MODELS)

[3] F. CRESTANI, M. LALMAS, C.J. RIJSBERGEN, I. CAMPBELL: IS THIS DOCUMENT RELEVANT? ... PROBABLIY: A SURVEY OF PROBABILISTIC MODELS IN INFORMATION RETRIEVAL, ACM COMPUTING SURVEYS, VOL. 30, NO. 4, DEC. 1998CHAPTER 1-3.4 (PROBABILISTIC MODELS)

[4] N. FUHR, C. BUCKLEY: A PROBABILISTIC LEARNING APPROACH FOR DOCUMENT INDEXING, ACM TRANSACTIONS ON INFORMATION SYSTEMS, VOL. 9, NO. 3, JULY 1991CHAPTER 2 AND 4 (PROBABILISTIC MODELS)

Web Search - Summer Term 2006

II. Information Retrieval (Basics: Relevance

Feedback)

(c) Wolfgang Hürst, Albert-Ludwigs-University

Relevance FeedbackMotivation:

Formulating a good query is often difficultIdea:

Improve search result by indicating the relevance of the initially returned docs

Possible usage:- Get better search results- Re-train the current IR model

Different approaches based on- User feedback- Local information in the initial result set- Global information in the whole doc. coll.

Relev. Feedb. based on User InputProcedure:

- User enters initial query- System returns result to user based on this query- User marks relevant documents- System selects important terms from marked docs.- System returns new result based on these terms

Two approaches:- Query Expansion- Term Re-weighting

Advantages:- Breaks down search task in smaller steps- Relevance judgments easier to make than (re-)formulation of a query- Controlled process to emphasize relevant terms and

de-emphasize irrelevant ones

Query Expansion & Term Re-Weighting for the Vector Model

Vector Space Model: Representation of documents and queries as weighted vectors of terms

Assumption:- Large overlap of term sets from relevant documents- Small overlap of term sets from irrelevant docs.

Basic idea:Re-formulate the query in order to get the query vector closer to the documents marked as relevant

Optimum Query Vector= Set of returned docs marked as rel.= Set of returned docs marked as irrel.= Set of all relevant docs in the doc. coll.

= No. of docs in the respective doc. sets= Constant factors (for fine-tuning)

Best query vector to distinguish relevant from non-relevant docs:

Query Expansion & Term Re-Weighting

Based on the relevance feedback from the user we incrementally change the initial query vector q to create a better query vector qm

Goal: Approximation of the optimum query vector qopt

Other approaches exist, e.g. Ide_Regular, Ide_Dec_Hi

Standard_Rochio approach:

Relev. Feedb. without User InputDifferent approaches based on

- User feedback- Local information in the initial result set- Global information in the whole doc. coll.

Basic idea of relevance feedback: Clustering, i.e. the docs marked as relevant contain additional terms which describe a larger cluster of relevant docs

So far: Get user feedback to create this term setNow: Approaches to get these term sets automatically

Two approaches:- Local strategies (based on returned result set)- Global strategies (based on whole doc. Collection)

Query Exp. Through Local Clustering

Motivation: Given a query q, there exists a local relationship between relevant documents

Basic idea: Expand query q with additional terms based on a clustering of the documents from the initial result set

Different approaches exist: - Association Clusters: Assume a correlation between terms co-occurring in different docs - Metric Clusters: Assume a correlation between terms close to each other (in a document) - Scalar Clusters: Assume a correlation between terms with a similar neighborhood

Metric ClustersNote: In the following we consider word stems s instead of terms (analogous to the literature; works similar w. terms)

= Distance between two terms ti and tj

in document d (in no. of terms)

= root of term t

= set of all words with root s

Define a local stem-stem correlation matrix s with elements su,v based on the correlation cu,v

or normalized:

Query Exp. With Metric Clusters

Clusters based on Metric-Correlation-Matrices: Generated by returning the n terms (roots) sv with the highest entries su,v values given a term su

Use these clusters for query expansion

Comments: - Clusters do not necessarily contain synonyms - Non-normalized clusters often contain high frequency terms - Normalized clusters often group terms that appear less often - Therefore: Combined approaches exist (i.e. using normalized and non-normalized clusters)

Overview of Approaches

Based on user feedback

Based on local information in the initial result set- Local clustering- Local context analysis (combine local and global info)

Based on global information in the whole document collection, examples:- Query expansion using a similarity thesaurus- Query expansion using a statistical thesaurus

References (Books)

R. BAEZA-YATES, B. RIBEIRO-NETO: MODERN INFORMATION RETRIEVAL, ADDISON WESLEY, 1999

WILLIAM B. FRAKES, RICARDO BAEZA-YATES (EDS.): INFORMATION RETRIEVAL – DATA STRUCTURES AND ALGORITHMS, P T R PRENTICE HALL, 1992

C. J. VAN RIJSBERGEN: INFORMATION RETRIEVAL, 1979, http://www.dcs.gla.ac.uk/Keith/Preface.html

C. MANNING, P. RAGHAVAN, H. SCHÜTZ: INTRODUCTIONTO INFORMATION RETRIEVAL (TO APPEAR 2007)http://www-csli.stanford.edu/~schuetze/information-retrieval-book.html

I. WITTEN, A. MOFFAT, T. BELL: MANAGING GIGABYTES, MORGAN KAUFMANN PUBLISHING, 1999

N. FUHR: SKRIPTUM ZUR VORLESUNG INFORMATION RETRIEVAL, SS 2006

AND MANY MORE!

References (Articles)G. SALTON: A BLUEPRINT FOR AUTOMATIC INDEXING,

ACM SIGIR FORUM, VOL. 16, ISSUE 2, FALL 1981F. CRESTANI, M. LALMAS, C.J. RIJSBERGEN, I. CAMPBELL: IS THIS

DOCUMENT RELEVANT? ... PROBABLIY: A SURVEY OF PROBABILISTIC MODELS IN INFORMATION RETRIEVAL, ACM COMPUTING SURVEYS, VOL. 30, NO. 4, DEC. 1998

N. FUHR, C. BUCKLEY: A PROBABILISTIC LEARNING APPROACH FOR DOCUMENT INDEXING, ACM TRANSACTIONS ON INFORMATION SYSTEMS, VOL. 9, NO. 3, JULY 1991

Further SourcesIR-RELATED CONFERENCES:ACM SIGIR International Conference on Information RetrievalACM / IEEE Joint Conference on Digital Libraries (JCDL)ACM Conference on Information Knowledge and Management (CIKM)Text REtrieval Conference (TREC), http://trec.nist.gov

INDEX

Recap: IR System & Tasks Involved

INFORMATION NEED DOCUMENTS

User Interface

PERFORMANCE EVALUATION

QUERY

QUERY PROCESSING (PARSING & TERM

PROCESSING)

LOGICAL VIEW OF THE INFORM. NEED

SELECT DATA FOR INDEXING

PARSING & TERM PROCESSING

SEARCHING

RANKING

RESULTS

DOCS.

RESULT REPRESENTATION

ScheduleIntroductionIR-Basics (Lectures) Overview, terms and definitions Index (inverted files) Term processing Query processing Ranking (TF*IDF, …) Evaluation IR-Models (Boolean, vector, probabilistic)IR-Basics (Exercises)Web Search (Lectures and exercises)

Organizational Remarks

Exercises:Please, register for the exercises by sending me

([email protected]) an email containing- Your name,- Matrikelnummer,- Studiengang (BA, MSc, Diploma, ...)- Plans for exam (yes, no, undecided)

This is just to organize the exercises, i.e. there are no consequences if you decide to drop this course.

Registrations should be done before the exercises start. Later registration might be possible under certain circumstances (contact me).