Upload
ralph-rich
View
224
Download
4
Embed Size (px)
Citation preview
Discriminative Models for Information Retrieval
Ramesh NallapatiUMass
SIGIR 2004
Abstract Discriminative model vs. Generative model
Discriminative – attractive theoretical properties Performance Comparison
Discriminative – Maximum Entropy, Support Vector Machine
Generative – Language Modeling Experiment
Ad-hoc Retrieval ME is worse than LM, SVM are on par with LM
Home-Page Finding Prefer SVM over LM
Introduction Traditional IR
A problem of measuring the similarity between docs and query, such as Vector Space Model
Shortcoming Term-weights – empirically tuned No theoretical basis for computing optimum
weights Binary independence retrieval (BIR)
Robertson and Sparck Jones (1976) First model that viewed IR as a classification problem This allows us to leverage many sophisticated
techniques developed in ML domanin Discriminative models
Good success in many applications of ML
Discriminative and Generative Classifiers
Pattern Classification The problem of classifying an example based on its
vector of features x into its class C through a posterior probability P(C|x) or simply a confidence score g(C|x)
Discriminative models Model the posterior directly or learn a direct map
from inputs x to the class labels C Generative models
Model the class-conditional probability P(x|C) and the prior probability P(C) and estimate the posterior through the Bayes’ rule( | ) ( | ) ( )P C x P x C P C
Probabilistic IR models as Classifiers (1/3)
Binary Independence Retrieval (BIR) model Ranking is done by the log-likelihood ration of relevance
: 1 : 0
( | ) ( | ) ( )log( ) log( )
( | ) ( | ) ( )
( 1| ) ( 0 | ) ( )log( )
( 1| ) ( 0 | ) ( )i i
i i
i x i xi i
P R D P D R P R
P R D P D R P R
P x R P x R P R
P x R P x R P R
Ranking is done by the log-likelihood ration of relevance The model has not met with good empirical success owing
to the difficulty in estimating the class conditional P(xi=1|R)
Assume uniform probability distribution over the entire vocabulary and update the probabilities as relevant docs are provided by the user
Probabilistic IR models as Classifiers (2/3)
Two-Poisson model Follow the same framework as that of BIR model, but
they use a mixture of two Poisson distributions to model the class conditions and( | )P D R ( | )P D R
1
1
( | ) ( ( ( ) | ) ( | ) ( ( ) | ) ( | ))
exp( ) exp( ) ( (1 ) )
! !
n
i ii
k kn
i
P D R P c w k E P E R P c w k E P E R
l l m mp p
k k
This also is a generative model Similar to the BIR model, it also needs relevance
feedback for accurate parameter estimation
Probabilistic IR models as Classifiers (3/3)
Language Models Ponte and Croft (1998) The ranking of a doc is given by the probability of
generation of the query from doc’s language model
1 21
( | ) ( , ,.., | ) ( | )n
n ii
P Q D P q q q D P q D
This model circumvent the problem of estimating the
model of relevant documents that the BIR model and Two-Poisson suffer from
LM can be considered generative classifiers in a multi-class classification sense
The Case for Discriminative Models for IR (1/3)
Discriminative vs. Generative One should solve the problem (classification) directly
and never solve a more general problem (class-conditional) as an intermediate step
Model Assumptions GM
Terms are conditionally independent LM assume docs obey a multinomial distribution of
terms DM
Typically make few assumptions and in a sense, let the data speak for itself.
The Case for Discriminative Models for IR (2/3)
Expressiveness GM - LM are not expressive enough to incorporate
many features into the model DM - It can include all features effortlessly into a
single model Learning arbitrary features
In view of the many query dependent and query-independent doc features and user-preferences that influence features, we believe that a DM that learns all the features is best suited for the generalized IR problem
The Case for Discriminative Models for IR (3/3)
Notion of Relevance In LM, there is no explicit notion of relevance. There
has been considerable controversy on the missing relevance variable in LM.
We believe that Robertson’s view of IR as a binary classification problem of relevance is more realistic than the implicit notion of relevance as it exists in LM.
Discriminative Models Used in Current Work (1/2)
Maximum Entropy Model The principle of ME – model all that is known and
assume nothing about that which is unknown The parametric form of the ME probability function
can be expressed by
,1
1( | , ) exp( ( , ))
( , )
n
i R ii
P R D Q f D QZ Q D
The feature weights (λ) are learned from training data
using a fast gradient descent algorithm As in Robertson’s BIR model, we use the log-likelihood ratio
as the scoring function for ranking as shown follows
, ,1
( | , )log ( ) ( , )
( | , )
n
i R ii Ri
P R D Qf D Q
P R D Q
Discriminative Models Used in Current Work (2/2)
Support Vector Machines (SVM) Basic idea – find the hyper-plane that separate the two
classes of training examples with the largest margin If f(D,Q) is the vector of features, then the discriminative
function is given by
( | , ) ( ( , ))g R D Q w f D Q b
The SVM is trained such that g(R|D,Q)>=1 for positive (relevant) examples and g(R|D,Q) <= -1 for negative (non-relevant) examples as long as the data is separable
Both DMs Retaining the basic framework of the BIR model, while
avoid estimating the class-conditional and instead directly compute the posterior P(R|Q,D) or the mapping function g(R|R,Q)
Other Modeling Issues
Out of Vocabulary words (OOV) problem Test queries are almost always guaranteed to contain
words that are not seen in the training queries The features are not based on words themselves, but
on query-based statistics of documents such as the total frequency of occurrences or the sum-total of the idf-values of the query terms
Unbalanced data The classes (non-relevant) is a large portion of all the
examples, while the other (relevant) class have only a small percent of of the examples
Over-sampling the minority class Under-sampling the majority class
Experiments and Results (1/8) Ad-hoc retrieval
Data set
Preprocessing K-stemmer and removing stop-words
Only use title queries for retrieval LM
Training the LM consists of learning the optimal value of the smoothing parameter
All LM runs were performed using Lemur
Experiments and Results (2/8) DM
Features
SVM svm-light for SVM runs Linear kernel gives the best performance on
most data set (converge rapidly) ME
The toolkit of Zhang
Experiments and Results (3/8) The comparison of performance of LM, SVM and ME
50% (8/16) is indistinguishable; 12.5% (2/16) is that SVM is statistically better than LM; 37.5 (6/16) is that LM is superior to SVM
Experiments and Results (4/8) Discussion
Official TREC runs – query expansion DM can improve the performance by including other
features such as proximity of query terms, occurrence of query terms as noun-phrases, etc. Such features would not be easy to incorporate into the LM framework
In the emergence of modern IR collections such as web and scientific literature that are characterized by a diverse variety of features, we will increasingly rely on models that can automatically learn these features from examples
Experiments and Results (5/8) Home-page finding on web collection
Choose the home-page finding task of TREC-10 where many features such as title, anchor-text and link structure influence relevance
Example – returned the web-page http://trec.nist.gov where the query “Text Retrieval Conference” is issued
Corpus – WT10G Queries
50 for training, 50 for development and 145 for testing
Evaluation Metrics The mean reciprocal rank (MRR) Success rate – an answer is found in top 10 Failure rate – no answer is returned in top 100
Experiments and Results (6/8) Three Indexes
A content index consisting of the textual content of the documents with all the HTML tags removed
An index of the anchor text documents An index of the titles of all documents
20 Features Use 6 previous features from each of the three
indexes Two additional features
URL-depth – a home-page typically is at depth 1 Link-Factornum-links( )
log(1 )Avg-num-links
D
Experiments and Results (7/8) Performance on development set
Performance on test set
Experiments and Results (8/8) Discussion
SVMs leverage a variety of features and improve on the baseline LM performance by 48.6% in MRR
The best run in TREC-10 achieved an MRR of 0.77 on the test set; however, their feature weights were optimized using empirical means while our models learn them automatically.
Only demonstrate the learning ability of SVMs We believe there is a lot more that needs to be in
defining the right kind of features, such as PageRank for the link-factor feature.
Related Works A few attempts in applying discriminative models for IR Cooper and Huizinga make a strong case for applying
the maximum entropy approach to the problems of information retrieval
Kantor and Lee extend the analysis of the principle of maximum entropy in the context of information retrieval
Greiff and Ponte showed that the classic binary independence model and the maximum entropy approach are equivalent
Gey suggested the method of logistic regression, which is equivalent to the method of maximum entropy used in our work
Conclusion and Future Work (1/2)
Treat IR as a problem of binary classification Quantify relevance explicitly Permit us apply sophisticated pattern classification
technique Explore SVMs and MEs to IR
Their main utility to IR lies in their ability to learn automatically a variety of features that influence relevance
Ad-hoc retrieval SVMs perform as well as LMs
Home-page finding SVMs outperform the baseline runs by about 50% in
MRR
Conclusion and Future Work (2/2)
Future Work Further improvement through better feature
engineering and by leveraging a huge body of literature on SMVs and other learning algorithms
Evaluate the performance of SVMs on ad-hoc retrieval task with longer queries
Enhance features such as proximity of query terms, synonyms, etc.
Study user modeling by incorporating user-preferences as features in the SVMs