Two-stage Language Models for Information Retrieval

Two-stage Language Models for

Information Retrieval

ChengXiang Zhai*, John Lafferty

School of Computer Science

Carnegie Mellon University

*New Address

Department of Computer Science

University of Illinois, Urbana-Champaign

Motivation

• Retrieval parameters are needed to

– model different user preferences

– customize a retrieval model according to different queries and documents

• So far, parameters have been set through empirical experimentation

• Can we set parameters automatically?

Parameters in Traditional Models

• EXTERNAL to the model, hard to interpret

– Most parameters are introduced heuristically to implement our “intuition”

– As a result, no principles to quantify them

• Set through empirical experiments

– Lots of experimentation

– Optimality for new queries is not guaranteed

“k1, b and k3 are parameters which depend on the nature of the queries and possibly on the database; k1 and b default to 1.2 and 0.75 respectively, but smaller values of b are sometimes advantageous; in long queries k3 is often set to 7 or 1000 (effectively infinite).”

Example of Parameter Tuning (Okapi)

(Robertson et al. 1999)

The Way to Automatic Tuning ...

• Parameters must be PART of the model!

– Query modeling (explain difference in query)

– Document modeling (explain difference in doc)

• De-couple the influence of a query on parameter setting from that of documents

– To achieve stable setting of parameters

– To pre-compute query-independent parameters

Two-stage Language Models

Risk Minimization Retrieval Framework

Two-stage Dirichlet-Mixture smoothing

The Rest of the Talk

Parameter estimation

The Risk Minimization Framework(Lafferty & Zhai 01, Zhai 02)

Document Language ModelsDocuments

DOC MODELING

QueryQuery

Language Model

QUERY MODELING

Loss Function User

USER MODELING

Retrieval Decision: ?

Parameter Setting in Risk Minimization

QueryQuery

Language Model

Document Language Models

Loss Function User

Documents

Query model parameters

Doc model parameters

User model parameters

Estimate

Estimate

Set

Two-stage Language Models

Query Language Model

Document Language Model

Loss Function

otherwisec

ifl DQ

DQ

),(0),(

),|( Sdp D

Query

Doc

D

Q

d

),|( Uqp Qq

),ˆ|(),( UqpqdR DQ

Rank

Risk ranking formula

stage-1

stage-2

12

Smoothing!

Sensitivity in Traditional (“one-stage”)

Smoothing

Keyword

Verbose(sentence-like)

The Need of Two-stage Smoothing (I)

Accurate Estimation of Doc Model

Document

Text miningpaper

Language Model P(w|d)

…text 10/500=0.02mining 3/500=0.006assocation 1/500=0.002algorithm 2/500=0.004…data 0/500=0

…

Query = “data mining algorithms”

p(q) = p(“data”|d)p(“mining”|d)p(“algorithm”|d) = 0*0.006*0.004 = 0!

?

P(“data”|d) = ?

P(“unicorn”|d) = ?

The Need of Two-stage Smoothing (II)

Explanation of Noise in Query

Query = “the algorithms for data mining”

d1: 0.04 0.001 0.02 0.002 0.003 d2: 0.02 0.001 0.01 0.003 0.004

p( “algorithms”|d1) = p(“algorithm”|d2)p( “data”|d1) < p(“data”|d2)

p( “mining”|d1) < p(“mining”|d2)

But p(q|d1)>p(q|d2)!

We should make p(“the”) and p(“for”) less different for all docs.

Two-stage Dirichlet-Mixture Smoothing

c(w,d)

|d|P(w|d) =

+p(w|C)

+

Stage-1 Smoothing-Explain unseen words-Dirichlet prior-Add pseudo counts

(1-) + p(w|U)

Stage-2 Smoothing-Explain noise in query-2-component mixture-Linear interpolation

Estimating using leave-one-out

P(w1|d- w1)

P(w2|d- w2)

N

i Vw i

ii d

CwpdwcdwcCL

11 )

1||

)|(1),(log(),()|(

log-likelihood

)ˆ C|(μargmaxμ 1μ

L

Maximum Likelihood Estimator

Newton’s Method

Leave-one-outw1

w2

P(wn|d- wn)

wn

...

Estimating using Mixture Model

query

1

N

...

),|(maxargˆ

))|()ˆ|()((),|(

Uqp

UqpqpUqpN

i

m

jjdji i

1 1

1

Maximum Likelihood Estimator Expectation-Maximization (EM) algorithm

Simultaneously adjust

, and 1,…, N to maximize

query likelihood

P(w|d1)d1

P(w|dN)dN

… ...

Stage-1

(1-)p(w|d1)+ p(w|U)

(1-)p(w|dN)+ p(w|U)

Stage-2

Effectiveness of Parameter Estimation

• Five databases

– News articles (AP, WSJ, ZIFF, FBIS, FT, LA)

– Government documents (Federal Register)

– Web pages

• Four types of queries

– Long vs. short

– Verbose (sentence-like) vs. keyword

• Results: Automatic 2-stage Optimal 1-stage

Collection query Optimal-JM Optimal-Dir Auto-2stageSK 20.3% 23.0% 22.2%*LK 36.8% 37.6% 37.4%SV 18.8% 20.9% 20.4%LV 28.8% 29.8% 29.2%SK 19.4% 22.3% 21.8%*LK 34.8% 35.3% 35.8%SV 17.2% 19.6% 19.9%LV 27.7% 28.2% 28.8%*SK 17.9% 21.5% 20.0%LK 32.6% 32.6% 32.2%SV 15.6% 18.5% 18.1%LV 26.7% 27.9% 27.9%*

AP88-89

WSJ87-92

ZIFF1-2

Automatic 2-stage results Optimal 1-stage results

Average precision (3 DB’s + 4 query types, 150 topics)

Automatic 2-stage results Optimal 1-stage results

Average precision ( 2 large DB’s + 2 query types, 50 topics)

Collection Query Optimal-JM Optimal-Dir Auto 2-Stage351-400title 0.167 0.186 0.182351-400long 0.222 0.224 0.23401-450title 0.239 0.256 0.257401-450long 0.265 0.26 0.268401-450title 0.243 0.294 0.278*401-450long 0.259 0.275 0.284Web

Disk4&5-CR

Conclusions

• Two-stage language models

– Direct modeling of both queries and documents

– Parameters are part of a probabilistic model

– Parameters can be estimated using standard estimation techniques

• Two-stage Dirichlet-Mixture smoothing

– Involves two meaningful parameters (I.e., document sample size and query noise)

– Achieves very good performance through automatically setting smoothing parameters

• It is possible to set parameters automatically!

Future Work

• Optimality analysis in the two-stage parameter space

• Offline vs. online estimation

• Alternative estimation methods

• Parameter estimation for more sophisticated language models (e.g., with feedback)

Thank you!