Cumulative Progress in Language Models for Information Retrieval Antti Puurula 6/12/2013 Australasian Language Technology Workshop University of Waikato

Cumulative Progress in Language Models for Information Retrieval

Antti Puurula

6/12/2013

Australasian Language Technology Workshop

University of Waikato

Ad-hoc Information Retrieval

• Ad-hoc Information Retrieval (IR) forms the basic task in IR:• Given a query, retrieve and rank documents in a collection

• Origins: • Cranfield 1 (1958-1960), Cranfield 2 (1962-1966), SMART (1961-

1999)

• Major evaluations:• TREC Ad-hoc (1990-1999), TREC Robust (2003-2005), CLEF (2000-

2009), INEX (2009-2010), NTCIR (1999-2013), FIRE (2008-2013)

Illusionary Progress in Ad-hoc IR

• TREC ad-hoc evaluations stopped in 1999, as progress plateaued

• More diverse tasks became the foci of research

• “There is little evidence of improvement in ad-hoc retrieval technology over the past decade” (Armstrong et al. 2009)• Weak baselines, non-cumulative improvements• ⟶ “no way of using LSI achieves a worthwhile improvement in

retrieval accuracy over BM25” (Atreya & Elkan, 2010)• ⟶ “there remains very little room for improvement in ad hoc

search” (Trotman & Keeler, 2011)

Progress in Language Models for IR?

• Language Models (LM) form one of the main approaches to IR

• Many improvements to LMs not adopted generally or evaluated systematically• TF-IDF feature weighting• Pitman-Yor Process smoothing• Feedback models

• Are these improvements consistent across standard datasets, cumulative, and do they improve on a strong baseline?

Query Likelihood Language Models

• Query Likelihood (QL) (Kalt 1996, Hiemstra 1998, Ponte & Croft 1998) is the basic application of LMs for IR

• Unigram case: using count vectors to represent documents and queries , rank documents given a query according to

• Assuming a generative model , and uniform priors over :

Query Likelihood Language Models 2

• The unigram QL-score for each document becomes:

• where is the Multinomial coefficient, and document models are given by the Maximum Likelihood estimates:

Pitman-Yor Process Smoothing

• Standard methods for smoothing in IR LMs are Dirichlet Prior (DP) and 2-Stage Smoothing (2SS) (Zhai & Lafferty 2004, Smucker & Allan 2007)

• Recent suggested improvement is Pitman-Yor Process smoothing (PYP), an approximation to inference on a Pitman-Yor Process (Momtazi & Klakow 2010, Huang & Renals 2010)

• All methods interpolate unsmoothed parameters with a background distribution. PYP additionally discounts the unsmoothed counts

Pitman-Yor Process Smoothing 2

• All methods share the form:

• DP:

• 2SS:

• PYP: , and


• All methods share the form:

• DP:

• 2SS:

• PYP: , and

,


• The background model is most commonly estimated by concatenating all collection documents into a single document:

• Less commonly, a uniform background model is used:

TF-IDF Feature Weighting

• Multinomial modelling assumptions of text can be corrected with TF-IDF weighting (Rennie et al. 2003, Frank & Bouckaert 2006)

• Traditional view: IDF-weighting unnecessary with IR LMs (Zhai & Lafferty 2004)

• Recent view: combination is complementary (Smucker & Allan 2007, Momtazi et al. 2010)

TF-IDF Feature Weighting 2

• Dataset documents can be weighted by TF-IDF:

• , where is the unweighted count vector, the number of documents, and number of documents where word occurs

• First factor is TF log transform using unique length normalization (Singhal et al. 1996)

• Second factor is Robertson-Walker IDF(Robertson & Zaragoza 2009)

TF-IDF Feature Weighting 3

• IDF has a overlapping function to collection smoothing (Hiemstra & Kraaij 1998)

• Interaction taken into account by replacing collection model by a uniform model in smoothing:

Model-based Feedback

• Pseudo-feedback is a traditional method in Ad-hoc IR:• Using the retrieved documents for original query , construct and

rank using a new query

• With LMs two different formalizations enable model-based feedback:• Kl-Divergence Retrieval (Zhai & Lafferty 2001)• Relevance Models (Lavrenko & Croft 2001)

• Both enable replacing the original query counts by a model

Model-based Feedback 2

• Many modeling choices exist for the feedback models, such as:• Using top retrieved documents (commonly )• Truncating the word vector to words present in the original query• Weighting the feedback documents using • Interpolating the feedback model with the original query

• These modeling choices are combined here

Model-based Feedback 3

• The interpolated query model is estimated for the query words from the top document models :

• , where is the interpolation weight and is normalizer:

Experimental Setup

• Ad-hoc IR experiments conducted on 13 standard datasets• TREC1-5 split according to data

source• OHSU-TREC• FIRE 2008-2011 English

• Preprocessing: stopword & short word() removal, Porter-stemming

• Each dataset split into development and evaluation subsets

Experimental Setup 2

• Software used for experiments was the SGMWeka 1.44 toolkit:• http://sourceforge.net/projects/sgmweka/

• Smoothing parameters optimized on development sets using Gaussian Random Searches (Luke 2009)

• Evaluation performed on evaluation sets, using Mean Average Precision of top documents (MAP@50)

• Significance tested with paired one-tailed t-tests between the datasets, with

Results

• Significant differences:• PYP > DP• PYP+TI > 2SS• PYP+TI+FB > PYP+TI

• PYP+TI+FB improves on 2SS by 4.07 MAP@50 absolute, a 17.1% relative improvement

Discussion

• The 3 evaluated improvements in language models for IR:• require little additional computation

• can be implemented with small modifications to existing IR systems• are substantial, significant and cumulative across 13 standard

datasets, compared to DP and 2SS baselines (4.07 MAP@50 absolute, 17.1% relative)

• Improvements requiring more computation possible• document neighbourhood smoothing, word correlation models,

passage-based LMs, bigram LMs, …

• More extensive evaluations needed for confirming progress

Documents

Cumulative Progress in Language Models for Information Retrieval Antti Puurula 6/12/2013 Australasian Language Technology Workshop University of Waikato