Discriminative Models for Information Retrieval Ramesh Nallapati UMass SIGIR 2004

Discriminative Models for Information Retrieval

Ramesh NallapatiUMass

SIGIR 2004

Abstract Discriminative model vs. Generative model

Discriminative – attractive theoretical properties Performance Comparison

Discriminative – Maximum Entropy, Support Vector Machine

Generative – Language Modeling Experiment

Ad-hoc Retrieval ME is worse than LM, SVM are on par with LM

Home-Page Finding Prefer SVM over LM

Introduction Traditional IR

A problem of measuring the similarity between docs and query, such as Vector Space Model

Shortcoming Term-weights – empirically tuned No theoretical basis for computing optimum

weights Binary independence retrieval (BIR)

Robertson and Sparck Jones (1976) First model that viewed IR as a classification problem This allows us to leverage many sophisticated

techniques developed in ML domanin Discriminative models

Good success in many applications of ML

Discriminative and Generative Classifiers

Pattern Classification The problem of classifying an example based on its

vector of features x into its class C through a posterior probability P(C|x) or simply a confidence score g(C|x)

Discriminative models Model the posterior directly or learn a direct map

from inputs x to the class labels C Generative models

Model the class-conditional probability P(x|C) and the prior probability P(C) and estimate the posterior through the Bayes’ rule( | ) ( | ) ( )P C x P x C P C

Probabilistic IR models as Classifiers (1/3)

Binary Independence Retrieval (BIR) model Ranking is done by the log-likelihood ration of relevance

: 1 : 0

( | ) ( | ) ( )log( ) log( )

( | ) ( | ) ( )

( 1| ) ( 0 | ) ( )log( )

( 1| ) ( 0 | ) ( )i i

i i

i x i xi i

P R D P D R P R

P R D P D R P R

P x R P x R P R

P x R P x R P R

Ranking is done by the log-likelihood ration of relevance The model has not met with good empirical success owing

to the difficulty in estimating the class conditional P(xi=1|R)

Assume uniform probability distribution over the entire vocabulary and update the probabilities as relevant docs are provided by the user


Two-Poisson model Follow the same framework as that of BIR model, but

they use a mixture of two Poisson distributions to model the class conditions and( | )P D R ( | )P D R

1

1

( | ) ( ( ( ) | ) ( | ) ( ( ) | ) ( | ))

exp( ) exp( ) ( (1 ) )

! !

n

i ii

k kn

i

P D R P c w k E P E R P c w k E P E R

l l m mp p

k k

This also is a generative model Similar to the BIR model, it also needs relevance

feedback for accurate parameter estimation


Language Models Ponte and Croft (1998) The ranking of a doc is given by the probability of

generation of the query from doc’s language model

1 21

( | ) ( , ,.., | ) ( | )n

n ii

P Q D P q q q D P q D

This model circumvent the problem of estimating the

model of relevant documents that the BIR model and Two-Poisson suffer from

LM can be considered generative classifiers in a multi-class classification sense

The Case for Discriminative Models for IR (1/3)

Discriminative vs. Generative One should solve the problem (classification) directly

and never solve a more general problem (class-conditional) as an intermediate step

Model Assumptions GM

Terms are conditionally independent LM assume docs obey a multinomial distribution of

terms DM

Typically make few assumptions and in a sense, let the data speak for itself.


Expressiveness GM - LM are not expressive enough to incorporate

many features into the model DM - It can include all features effortlessly into a

single model Learning arbitrary features

In view of the many query dependent and query-independent doc features and user-preferences that influence features, we believe that a DM that learns all the features is best suited for the generalized IR problem


Notion of Relevance In LM, there is no explicit notion of relevance. There

has been considerable controversy on the missing relevance variable in LM.

We believe that Robertson’s view of IR as a binary classification problem of relevance is more realistic than the implicit notion of relevance as it exists in LM.

Discriminative Models Used in Current Work (1/2)

Maximum Entropy Model The principle of ME – model all that is known and

assume nothing about that which is unknown The parametric form of the ME probability function

can be expressed by

,1

1( | , ) exp( ( , ))

( , )

n

i R ii

P R D Q f D QZ Q D

The feature weights (λ) are learned from training data

using a fast gradient descent algorithm As in Robertson’s BIR model, we use the log-likelihood ratio

as the scoring function for ranking as shown follows

, ,1

( | , )log ( ) ( , )

( | , )

n

i R ii Ri

P R D Qf D Q

P R D Q

Discriminative Models Used in Current Work (2/2)

Support Vector Machines (SVM) Basic idea – find the hyper-plane that separate the two

classes of training examples with the largest margin If f(D,Q) is the vector of features, then the discriminative

function is given by

( | , ) ( ( , ))g R D Q w f D Q b

The SVM is trained such that g(R|D,Q)>=1 for positive (relevant) examples and g(R|D,Q) <= -1 for negative (non-relevant) examples as long as the data is separable

Both DMs Retaining the basic framework of the BIR model, while

avoid estimating the class-conditional and instead directly compute the posterior P(R|Q,D) or the mapping function g(R|R,Q)

Other Modeling Issues

Out of Vocabulary words (OOV) problem Test queries are almost always guaranteed to contain

words that are not seen in the training queries The features are not based on words themselves, but

on query-based statistics of documents such as the total frequency of occurrences or the sum-total of the idf-values of the query terms

Unbalanced data The classes (non-relevant) is a large portion of all the

examples, while the other (relevant) class have only a small percent of of the examples

Over-sampling the minority class Under-sampling the majority class

Experiments and Results (1/8) Ad-hoc retrieval

Data set

Preprocessing K-stemmer and removing stop-words

Only use title queries for retrieval LM

Training the LM consists of learning the optimal value of the smoothing parameter

All LM runs were performed using Lemur

Experiments and Results (2/8) DM

Features

SVM svm-light for SVM runs Linear kernel gives the best performance on

most data set (converge rapidly) ME

The toolkit of Zhang

Experiments and Results (3/8) The comparison of performance of LM, SVM and ME

50% (8/16) is indistinguishable; 12.5% (2/16) is that SVM is statistically better than LM; 37.5 (6/16) is that LM is superior to SVM

Experiments and Results (4/8) Discussion

Official TREC runs – query expansion DM can improve the performance by including other

features such as proximity of query terms, occurrence of query terms as noun-phrases, etc. Such features would not be easy to incorporate into the LM framework

In the emergence of modern IR collections such as web and scientific literature that are characterized by a diverse variety of features, we will increasingly rely on models that can automatically learn these features from examples

Experiments and Results (5/8) Home-page finding on web collection

Choose the home-page finding task of TREC-10 where many features such as title, anchor-text and link structure influence relevance

Example – returned the web-page http://trec.nist.gov where the query “Text Retrieval Conference” is issued

Corpus – WT10G Queries

50 for training, 50 for development and 145 for testing

Evaluation Metrics The mean reciprocal rank (MRR) Success rate – an answer is found in top 10 Failure rate – no answer is returned in top 100

Experiments and Results (6/8) Three Indexes

A content index consisting of the textual content of the documents with all the HTML tags removed

An index of the anchor text documents An index of the titles of all documents

20 Features Use 6 previous features from each of the three

indexes Two additional features

URL-depth – a home-page typically is at depth 1 Link-Factornum-links( )

log(1 )Avg-num-links

D

Experiments and Results (7/8) Performance on development set

Performance on test set

Experiments and Results (8/8) Discussion

SVMs leverage a variety of features and improve on the baseline LM performance by 48.6% in MRR

The best run in TREC-10 achieved an MRR of 0.77 on the test set; however, their feature weights were optimized using empirical means while our models learn them automatically.

Only demonstrate the learning ability of SVMs We believe there is a lot more that needs to be in

defining the right kind of features, such as PageRank for the link-factor feature.

Related Works A few attempts in applying discriminative models for IR Cooper and Huizinga make a strong case for applying

the maximum entropy approach to the problems of information retrieval

Kantor and Lee extend the analysis of the principle of maximum entropy in the context of information retrieval

Greiff and Ponte showed that the classic binary independence model and the maximum entropy approach are equivalent

Gey suggested the method of logistic regression, which is equivalent to the method of maximum entropy used in our work

Conclusion and Future Work (1/2)

Treat IR as a problem of binary classification Quantify relevance explicitly Permit us apply sophisticated pattern classification

technique Explore SVMs and MEs to IR

Their main utility to IR lies in their ability to learn automatically a variety of features that influence relevance

Ad-hoc retrieval SVMs perform as well as LMs

Home-page finding SVMs outperform the baseline runs by about 50% in

MRR

Conclusion and Future Work (2/2)

Future Work Further improvement through better feature

engineering and by leveraging a huge body of literature on SMVs and other learning algorithms

Evaluate the performance of SVMs on ad-hoc retrieval task with longer queries

Enhance features such as proximity of query terms, synonyms, etc.

Study user modeling by incorporating user-preferences as features in the SVMs

Documents

Discriminative Models for Information Retrieval Ramesh Nallapati UMass SIGIR 2004