Language Models

Language ModelsHongning Wang

CS@UVa

CS 6501: Information Retrieval 2

Two-stage smoothing [Zhai &

Lafferty 02]

c(w,d)

|d|P(w|d) =

+p(w|C)

+

Stage-1

-Explain unseen words-Dirichlet prior (Bayesian)

Collection LM

(1-) + p(w|U)

Stage-2

-Explain noise in query-2-component mixture

User background modelCan be approximated by p(w|C)

CS@UVa


Estimating using EM algorithm [Zhai & Lafferty 02]

Query Q=q1…qm

1

N

...

mN

j i ji 1 j 1

p(Q | , U) ((1 )p(q | d ) p(q | U))

ˆ argmax p(Q | , U)

i

Expectation-Maximization (EM) algorithm for estimating and

P(w|d1)d1

P(w|dN)dN

… ...

Stage-1

(1-)p(w|d1)+ p(w|U)

(1-)p(w|dN)+ p(w|U)

Stage-2

ˆ( , ) ( | )( | )

ˆ| |j i j

j ii

c q d p q Cp q d

d

Estimated in stage-1

CS@UVa

Probability of choosing LM

Consider query generation as a mixture over all the relevant documents


Introduction to EM

• Parameter estimation• All data is observable

• Maximum likelihood method• Optimize the analytic form of

• Missing/unobservable data• Data: X (observed) + H(hidden)• Likelihood: • Approximate it!

CS@UVa

Most of cases are intractable

We have missing date.


Background knowledge

• Jensen's inequality• For any convex function and positive weights

CS@UVa

𝑓 (∑𝑖 𝜆𝑖 𝑥𝑖)≤∑𝑖

𝜆𝑖 𝑓 (𝑥 𝑖) ∑𝑖

𝜆𝑖=1


Expectation Maximization

• Maximize data likelihood function by pushing the lower bound

CS@UVa

≥∫𝑞 (𝐻 ) log𝑝 (𝑋 ,𝐻|𝜃 )𝑑𝐻−∫𝑞 (𝐻 ) log𝑞(𝐻 )𝑑𝐻Jensen's inequality𝑓 (𝐸 [ 𝑥 ] ) ≥𝐸 [ 𝑓 (𝑥)]

Lower bound: easier to compute, many good properties!

Components we need to tune when optimizing : and

Proposal distributions for


Intuitive understanding of EM

Data likelihood p(X| )

CS@UVa

Lower bound

Easier to optimize, guarantee to improve data likelihood


Expectation Maximization (cont)• Optimize the lower bound with respect to

CS@UVa

≥∫𝑞 (𝐻 ) log𝑝 (𝑋 ,𝐻|𝜃 )𝑑𝐻−∫𝑞 (𝐻 ) log𝑞(𝐻 )𝑑𝐻Jensen's inequality𝑓 (𝐸 [ 𝑥 ] ) ≥𝐸 [ 𝑓 (𝑥)]

¿∫𝑞 (𝐻 ) [ log𝑝 (𝐻|𝑋 ,𝜃 )+ log𝑝 (𝑋∨𝜃)]𝑑𝐻−∫𝑞 (𝐻 ) log𝑞 (𝐻 )𝑑𝐻¿∫𝑞 (𝐻 ) log

𝑝 (𝐻|𝑋 , 𝜃 )𝑞 (𝐻 )

𝑑𝐻+ log 𝑝 (𝑋∨𝜃)

KL-divergence between and Constant with respect to

𝐾𝐿¿



• KL-divergence is non-negative, and equals to zero iff • A step further: when , we will get , i.e., the lower bound

is tight!• Other choice of cannot lead to this tight bound, but

might reduce computational complexity• Note: calculation of is based on current

CS@UVa

𝐿 (𝜃 ) ≥ −𝐾𝐿 (𝑞(𝐻 )∨¿𝑝 (𝐻|𝑋 , 𝜃 ) )+𝐿(𝜃)



• Optimal solution:

CS@UVa

Posterior distribution of given current model



•

CS@UVa

Constant w.r.t.

¿𝑎𝑟𝑔𝑚𝑎𝑥𝜃𝐸𝐻∨𝑋 ,𝜃 𝑡 [ log𝑝 (𝑋 ,𝐻∨𝜃)]

Expectation of complete data likelihood


Expectation Maximization

• EM tries to iteratively maximize likelihood• “Complete” likelihood: • Starting from an initial guess (0),

1. E-step: compute the expectation of the complete likelihood

2. M-step: compute (t+1) by maximizing the Q-function

CS@UVa

𝑄 (𝜃 ;𝜃𝑡 )=E𝐻∨𝑋 ,𝜃𝑡 [𝐿𝑐 (𝜃 ) ]=∫𝑝 (𝐻|𝑋 ,𝜃 𝑡 ) log p ( X , H|𝜃𝑡 ) dH

𝜃𝑡+1=𝑎𝑟𝑔𝑚𝑎𝑥𝜃𝑄 (𝜃 ;𝜃𝑡 ) Key step!

13

Intuitive understanding of EM

Data likelihood p(X| )

current guess

Lower bound(Q function)

next guess

E-step = computing the lower boundM-step = maximizing the lower bound

S=We have missing date.𝑝 (𝑑𝑎𝑡𝑒|𝑆 ,𝜃 𝑡 )=0.1

𝑝 (𝑑𝑎𝑡𝑎|𝑆 ,𝜃 𝑡 )=0.9E-step:

𝑝 (𝑑𝑎𝑡𝑎|𝜃𝑡+1 )∝0.9+ ∑𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑

1M-step:


Convergence guarantee

• Proof of EM

CS@UVa

log𝑝 ( 𝑋|𝜃 )=log𝑝 (𝐻 ,𝑋|𝜃 ) − log𝑝 (𝐻∨𝑋 ,𝜃)

log𝑝 ( 𝑋|𝜃 )=∫𝑝 (𝐻∨𝑋 ,𝜃𝑡) log𝑝 (𝐻 ,𝑋|𝜃 )𝑑𝐻−∫𝑝 (𝐻∨𝑋 , 𝜃𝑡) log𝑝 (𝐻∨𝑋 , 𝜃)𝑑𝐻Taking expectation with respect to of both sides:

log𝑝 ( 𝑋|𝜃 )− log𝑝 (𝑋∨𝜃𝑡)=𝑄 (𝜃 ;𝜃 𝑡 )+𝐻 (𝜃 ;𝜃𝑡 ) −𝑄 (𝜃𝑡 ;𝜃𝑡 ) −𝐻 (𝜃𝑡 ;𝜃𝑡 )Then the change of log data likelihood between EM iteration is:

By Jensen’s inequality, we know , that means

log𝑝 ( 𝑋|𝜃 )− log𝑝 (𝑋∨𝜃𝑡)≥𝑄 (𝜃 ;𝜃 𝑡 ) −𝑄 (𝜃 𝑡 ;𝜃 𝑡 ) ≥ 0

log𝑝 ( 𝑋|𝜃 )=𝑄 (𝜃 ;𝜃𝑡 )+𝐻 (𝜃 ;𝜃𝑡) Cross-entropy

M-step guarantee this


• Global optimal is not guaranteed!• Likelihood: is non-convex in most of case• EM boils down to a greedy algorithm

• Alternative ascent

• Generalized EM• E-step: • M-step: choose that improves

CS@UVa

What is not guaranteed


Estimating using Mixture Model [Zhai & Lafferty 02]

CS@UVa

Query Q=q1…qm

1

N

...

mN

j i ji 1 j 1

p(Q | , U) ((1 )p(q | d ) p(q | U))

ˆ argmax p(Q | , U)

i

Expectation-Maximization (EM) algorithm for estimating and

P(w|d1)d1

P(w|dN)dN

… ...

Stage-1

(1-)p(w|d1)+ p(w|U)

(1-)p(w|dN)+ p(w|U)

Stage-2

ˆ( , ) ( | )( | )

ˆ| |j i j

j ii

c q d p q Cp q d

d

Estimated in stage-1

Origin of the query term is unknown, i.e., from user background or topic?


Variants of basic LM approach• Different smoothing strategies

• Hidden Markov Models (essentially linear interpolation) [Miller et al. 99]

• Smoothing with an IDF-like reference model [Hiemstra & Kraaij 99]

• Performance tends to be similar to the basic LM approach• Many other possibilities for smoothing [Chen & Goodman 98]

• Different priors• Link information as prior leads to significant improvement

of Web entry page retrieval performance [Kraaij et al. 02] • Time as prior [Li & Croft 03]

• PageRank as prior [Kurland & Lee 05]

• Passage retrieval [Liu & Croft 02]

CS@UVa


Improving language models

• Capturing limited dependencies• Bigrams/Trigrams [Song & Croft 99] • Grammatical dependency [Nallapati & Allan 02, Srikanth & Srihari 03, Gao et al. 04]

• Generally insignificant improvement as compared with other extensions such as feedback

• Full Bayesian query likelihood [Zaragoza et al. 03]

• Performance similar to the basic LM approach

• Translation model for p(Q|D,R) [Berger & Lafferty 99, Jin et al. 02,Cao et al. 05]

• Address polesemy and synonyms• Improves over the basic LM methods, but computationally expensive

• Cluster-based smoothing/scoring [Liu & Croft 04, Kurland & Lee 04,Tao et al. 06]

• Improves over the basic LM, but computationally expensive

• Parsimonious LMs [Hiemstra et al. 04] • Using a mixture model to “factor out” non-discriminative words

CS@UVa


A unified framework for IR: Risk Minimization• Long-standing IR Challenges

• Improve IR theory• Develop theoretically sound and empirically effective models • Go beyond the limited traditional notion of relevance

(independent, topical relevance)• Improve IR practice

• Optimize retrieval parameters automatically

• Language models are promising tools …• How can we systematically exploit LMs in IR?• Can LMs offer anything hard/impossible to achieve in

traditional IR?

CS@UVa


Idea 1: Retrieval as decision-making(A more general notion of relevance)

Unordered subset?

Clustering?

Given a query, - Which documents should be selected? (D) - How should these docs be presented to the user? ()Choose: (D,)

Query … Ranked list?1 2 3 4

CS@UVa


Idea 2: Systematic language modeling

DocumentLanguage Models

Documents

DOC MODELING

Query Query Language Model

QUERY MODELING

Loss Function User

USER MODELING

Retrieval Decision: ?

CS@UVa


Generative model of document & query [Lafferty & Zhai 01b]

observedPartiallyobserved

QU)|( Up Q

User

DS)|( Sp D

Source

inferred

),|( Sdp Dd Document

),|( Uqp Qq Query

( | , )Q Dp R R

CS@UVa


Applying Bayesian Decision Theory [Lafferty & Zhai 01b, Zhai 02, Zhai & Lafferty 06]

Choice: (D1,1)

Choice: (D2,2)

Choice: (Dn,n)

...

query quser U

doc set Csource S

q

1

N

∫

dSCUqpDLDD

),,,|(),,(minarg*)*,(,

hidden observedloss

Bayes risk for choice (D, )RISK MINIMIZATION

Loss

L

L

L

CS@UVa


Special cases

• Set-based models (choose D)• Ranking models (choose )

• Independent loss • Relevance-based loss

• Distance-based loss• Dependent loss

• Maximum Margin Relevance loss • Maximal Diverse Relevance loss

Boolean model

Vector-space ModelTwo-stage LM

KL-divergence model

Probabilistic relevance model Generative Relevance Theory

CS@UVa

Subtopic retrieval model


Optimal ranking for independent loss

CS@UVa

1 11 1

1 1

1

1 1

1

1 1

1

1 1

* arg min ( , ) ( | , , , )

( , ) ( | ... )

( )

( ) ( )

* arg min ( ) ( ) ( | , , , )

arg min ( ) ( ) (

j j

j

j

j

j

N i

ii j

N i

ii j

N jN

ij i

N jN

ij i

N jN

ij i

L p q U C S d

L s l

s l

s l

s l p q U C S d

s l p

∫

∫

| , , , )

( | , , , ) ( ) ( | , , , )

* ( | , , , )

j j

k k k k

k

q U C S d

r d q U C S l p q U C S d

Ranking based on r d q U C S

∫

∫

Decision space = {rankings}

Sequential browsing

Independent loss

Independent risk= independent scoring

“Risk ranking principle”[Zhai 02]

is the probability that the user would stop reading after seeing the top documents


Automatic parameter tuning• Retrieval parameters are needed to

• model different user preferences• customize a retrieval model to specific queries and documents

• Retrieval parameters in traditional models• EXTERNAL to the model, hard to interpret• Parameters are introduced heuristically to implement “intuition”• No principles to quantify them, must set empirically through many

experiments• Still no guarantee for new queries/documents

• Language models make it possible to estimate parameters…

CS@UVa


Parameter setting in risk minimization

Query Query Language Model

Document Language Models

Loss Function User

Documents

Query model parameters

Doc model parameters

User model parameters

Estimate

Estimate

Set

CS@UVa


Summary of risk minimization • Risk minimization is a general probabilistic retrieval

framework• Retrieval as a decision problem (=risk min.)• Separate/flexible language models for queries and docs

• Advantages• A unified framework for existing models• Automatic parameter tuning due to LMs• Allows for modeling complex retrieval tasks

• Lots of potential for exploring LMs…

CS@UVa


What you should know

• EM algorithm• Risk minimization framework

CS@UVa