29
Language Models Hongning Wang CS@UVa

Language Models

Embed Size (px)

DESCRIPTION

Language Models. Hongning Wang CS@UVa. Stage-1 -Explain unseen words - Dirichlet prior ( Bayesian). Stage-2 -Explain noise in query -2-component mixture. c(w,d). + p(w|C). (1- ). + p(w|U). . . |d|. + . Collection LM. P(w|d) =. User background model - PowerPoint PPT Presentation

Citation preview

Page 1: Language Models

Language ModelsHongning Wang

CS@UVa

Page 2: Language Models

CS 6501: Information Retrieval 2

Two-stage smoothing [Zhai &

Lafferty 02]

c(w,d)

|d|P(w|d) =

+p(w|C)

+

Stage-1

-Explain unseen words-Dirichlet prior (Bayesian)

Collection LM

(1-) + p(w|U)

Stage-2

-Explain noise in query-2-component mixture

User background modelCan be approximated by p(w|C)

CS@UVa

Page 3: Language Models

CS 6501: Information Retrieval 3

Estimating using EM algorithm [Zhai & Lafferty 02]

Query Q=q1…qm

1

N

...

mN

j i ji 1 j 1

p(Q | , U) ((1 )p(q | d ) p(q | U))

ˆ argmax p(Q | , U)

i

Expectation-Maximization (EM) algorithm for estimating and

P(w|d1)d1

P(w|dN)dN

… ...

Stage-1

(1-)p(w|d1)+ p(w|U)

(1-)p(w|dN)+ p(w|U)

Stage-2

ˆ( , ) ( | )( | )

ˆ| |j i j

j ii

c q d p q Cp q d

d

Estimated in stage-1

CS@UVa

Probability of choosing LM

Consider query generation as a mixture over all the relevant documents

Page 4: Language Models

CS 6501: Information Retrieval 4

Introduction to EM

• Parameter estimation• All data is observable

• Maximum likelihood method• Optimize the analytic form of

• Missing/unobservable data• Data: X (observed) + H(hidden)• Likelihood: • Approximate it!

CS@UVa

Most of cases are intractable

We have missing date.

Page 5: Language Models

CS 6501: Information Retrieval 5

Background knowledge

• Jensen's inequality• For any convex function and positive weights

CS@UVa

𝑓 (∑𝑖 𝜆𝑖 𝑥𝑖)≤∑𝑖

𝜆𝑖 𝑓 (𝑥 𝑖) ∑𝑖

𝜆𝑖=1

Page 6: Language Models

CS 6501: Information Retrieval 6

Expectation Maximization

• Maximize data likelihood function by pushing the lower bound

CS@UVa

≥∫𝑞 (𝐻 ) log𝑝 (𝑋 ,𝐻|𝜃 )𝑑𝐻−∫𝑞 (𝐻 ) log𝑞(𝐻 )𝑑𝐻Jensen's inequality𝑓 (𝐸 [ 𝑥 ] ) ≥𝐸 [ 𝑓 (𝑥)]

Lower bound: easier to compute, many good properties!

Components we need to tune when optimizing : and

Proposal distributions for

Page 7: Language Models

CS 6501: Information Retrieval 7

Intuitive understanding of EM

Data likelihood p(X| )

CS@UVa

Lower bound

Easier to optimize, guarantee to improve data likelihood

Page 8: Language Models

CS 6501: Information Retrieval 8

Expectation Maximization (cont)• Optimize the lower bound with respect to

CS@UVa

≥∫𝑞 (𝐻 ) log𝑝 (𝑋 ,𝐻|𝜃 )𝑑𝐻−∫𝑞 (𝐻 ) log𝑞(𝐻 )𝑑𝐻Jensen's inequality𝑓 (𝐸 [ 𝑥 ] ) ≥𝐸 [ 𝑓 (𝑥)]

¿∫𝑞 (𝐻 ) [ log𝑝 (𝐻|𝑋 ,𝜃 )+ log𝑝 (𝑋∨𝜃)]𝑑𝐻−∫𝑞 (𝐻 ) log𝑞 (𝐻 )𝑑𝐻¿∫𝑞 (𝐻 ) log

𝑝 (𝐻|𝑋 , 𝜃 )𝑞 (𝐻 )

𝑑𝐻+ log 𝑝 (𝑋∨𝜃)

KL-divergence between and Constant with respect to

𝐾𝐿¿

Page 9: Language Models

CS 6501: Information Retrieval 9

Expectation Maximization (cont)• Optimize the lower bound with respect to

• KL-divergence is non-negative, and equals to zero iff • A step further: when , we will get , i.e., the lower bound

is tight!• Other choice of cannot lead to this tight bound, but

might reduce computational complexity• Note: calculation of is based on current

CS@UVa

𝐿 (𝜃 ) ≥ −𝐾𝐿 (𝑞(𝐻 )∨¿𝑝 (𝐻|𝑋 , 𝜃 ) )+𝐿(𝜃)

Page 10: Language Models

CS 6501: Information Retrieval 10

Expectation Maximization (cont)• Optimize the lower bound with respect to

• Optimal solution:

CS@UVa

Posterior distribution of given current model

Page 11: Language Models

CS 6501: Information Retrieval 11

Expectation Maximization (cont)• Optimize the lower bound with respect to

CS@UVa

Constant w.r.t.

¿𝑎𝑟𝑔𝑚𝑎𝑥𝜃𝐸𝐻∨𝑋 ,𝜃 𝑡 [ log𝑝 (𝑋 ,𝐻∨𝜃)]

Expectation of complete data likelihood

Page 12: Language Models

CS 6501: Information Retrieval 12

Expectation Maximization

• EM tries to iteratively maximize likelihood• “Complete” likelihood: • Starting from an initial guess (0),

1. E-step: compute the expectation of the complete likelihood

2. M-step: compute (t+1) by maximizing the Q-function

CS@UVa

𝑄 (𝜃 ;𝜃𝑡 )=E𝐻∨𝑋 ,𝜃𝑡 [𝐿𝑐 (𝜃 ) ]=∫𝑝 (𝐻|𝑋 ,𝜃 𝑡 ) log p ( X , H|𝜃𝑡 ) dH  

𝜃𝑡+1=𝑎𝑟𝑔𝑚𝑎𝑥𝜃𝑄 (𝜃 ;𝜃𝑡 ) Key step!

Page 13: Language Models

13

Intuitive understanding of EM

Data likelihood p(X| )

current guess

Lower bound(Q function)

next guess

E-step = computing the lower boundM-step = maximizing the lower bound

S=We have missing date.𝑝 (𝑑𝑎𝑡𝑒|𝑆 ,𝜃 𝑡 )=0.1

𝑝 (𝑑𝑎𝑡𝑎|𝑆 ,𝜃 𝑡 )=0.9E-step:

𝑝 (𝑑𝑎𝑡𝑎|𝜃𝑡+1 )∝0.9+ ∑𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑

1M-step:

Page 14: Language Models

CS 6501: Information Retrieval 14

Convergence guarantee

• Proof of EM

CS@UVa

log𝑝 ( 𝑋|𝜃 )=log𝑝 (𝐻 ,𝑋|𝜃 ) − log𝑝 (𝐻∨𝑋 ,𝜃)

log𝑝 ( 𝑋|𝜃 )=∫𝑝 (𝐻∨𝑋 ,𝜃𝑡) log𝑝 (𝐻 ,𝑋|𝜃 )𝑑𝐻−∫𝑝 (𝐻∨𝑋 , 𝜃𝑡) log𝑝 (𝐻∨𝑋 , 𝜃)𝑑𝐻Taking expectation with respect to of both sides:

log𝑝 ( 𝑋|𝜃 )− log𝑝 (𝑋∨𝜃𝑡)=𝑄 (𝜃 ;𝜃 𝑡 )+𝐻 (𝜃 ;𝜃𝑡 ) −𝑄 (𝜃𝑡 ;𝜃𝑡 ) −𝐻 (𝜃𝑡 ;𝜃𝑡 )Then the change of log data likelihood between EM iteration is:

By Jensen’s inequality, we know , that means

log𝑝 ( 𝑋|𝜃 )− log𝑝 (𝑋∨𝜃𝑡)≥𝑄 (𝜃 ;𝜃 𝑡 ) −𝑄 (𝜃 𝑡 ;𝜃 𝑡 ) ≥ 0

log𝑝 ( 𝑋|𝜃 )=𝑄 (𝜃 ;𝜃𝑡 )+𝐻 (𝜃 ;𝜃𝑡) Cross-entropy

M-step guarantee this

Page 15: Language Models

CS 6501: Information Retrieval 15

• Global optimal is not guaranteed!• Likelihood: is non-convex in most of case• EM boils down to a greedy algorithm

• Alternative ascent

• Generalized EM• E-step: • M-step: choose that improves

CS@UVa

What is not guaranteed

Page 16: Language Models

CS 6501: Information Retrieval 16

Estimating using Mixture Model [Zhai & Lafferty 02]

CS@UVa

Query Q=q1…qm

1

N

...

mN

j i ji 1 j 1

p(Q | , U) ((1 )p(q | d ) p(q | U))

ˆ argmax p(Q | , U)

i

Expectation-Maximization (EM) algorithm for estimating and

P(w|d1)d1

P(w|dN)dN

… ...

Stage-1

(1-)p(w|d1)+ p(w|U)

(1-)p(w|dN)+ p(w|U)

Stage-2

ˆ( , ) ( | )( | )

ˆ| |j i j

j ii

c q d p q Cp q d

d

Estimated in stage-1

Origin of the query term is unknown, i.e., from user background or topic?

Page 17: Language Models

CS 6501: Information Retrieval 17

Variants of basic LM approach• Different smoothing strategies

• Hidden Markov Models (essentially linear interpolation) [Miller et al. 99]

• Smoothing with an IDF-like reference model [Hiemstra & Kraaij 99]

• Performance tends to be similar to the basic LM approach• Many other possibilities for smoothing [Chen & Goodman 98]

• Different priors• Link information as prior leads to significant improvement

of Web entry page retrieval performance [Kraaij et al. 02] • Time as prior [Li & Croft 03]

• PageRank as prior [Kurland & Lee 05]

• Passage retrieval [Liu & Croft 02]

CS@UVa

Page 18: Language Models

CS 6501: Information Retrieval 18

Improving language models

• Capturing limited dependencies• Bigrams/Trigrams [Song & Croft 99] • Grammatical dependency [Nallapati & Allan 02, Srikanth & Srihari 03, Gao et al. 04]

• Generally insignificant improvement as compared with other extensions such as feedback

• Full Bayesian query likelihood [Zaragoza et al. 03]

• Performance similar to the basic LM approach

• Translation model for p(Q|D,R) [Berger & Lafferty 99, Jin et al. 02,Cao et al. 05]

• Address polesemy and synonyms• Improves over the basic LM methods, but computationally expensive

• Cluster-based smoothing/scoring [Liu & Croft 04, Kurland & Lee 04,Tao et al. 06]

• Improves over the basic LM, but computationally expensive

• Parsimonious LMs [Hiemstra et al. 04] • Using a mixture model to “factor out” non-discriminative words

CS@UVa

Page 19: Language Models

CS 6501: Information Retrieval 19

A unified framework for IR: Risk Minimization• Long-standing IR Challenges

• Improve IR theory• Develop theoretically sound and empirically effective models • Go beyond the limited traditional notion of relevance

(independent, topical relevance)• Improve IR practice

• Optimize retrieval parameters automatically

• Language models are promising tools …• How can we systematically exploit LMs in IR?• Can LMs offer anything hard/impossible to achieve in

traditional IR?

CS@UVa

Page 20: Language Models

CS 6501: Information Retrieval 20

Idea 1: Retrieval as decision-making(A more general notion of relevance)

Unordered subset?

Clustering?

Given a query, - Which documents should be selected? (D) - How should these docs be presented to the user? ()Choose: (D,)

Query … Ranked list?1 2 3 4

CS@UVa

Page 21: Language Models

CS 6501: Information Retrieval 21

Idea 2: Systematic language modeling

DocumentLanguage Models

Documents

DOC MODELING

Query Query Language Model

QUERY MODELING

Loss Function User

USER MODELING

Retrieval Decision: ?

CS@UVa

Page 22: Language Models

CS 6501: Information Retrieval 22

Generative model of document & query [Lafferty & Zhai 01b]

observedPartiallyobserved

QU)|( Up Q

User

DS)|( Sp D

Source

inferred

),|( Sdp Dd Document

),|( Uqp Qq Query

( | , )Q Dp R R

CS@UVa

Page 23: Language Models

CS 6501: Information Retrieval 23

Applying Bayesian Decision Theory [Lafferty & Zhai 01b, Zhai 02, Zhai & Lafferty 06]

Choice: (D1,1)

Choice: (D2,2)

Choice: (Dn,n)

...

query quser U

doc set Csource S

q

1

N

dSCUqpDLDD

),,,|(),,(minarg*)*,(,

hidden observedloss

Bayes risk for choice (D, )RISK MINIMIZATION

Loss

L

L

L

CS@UVa

Page 24: Language Models

CS 6501: Information Retrieval 24

Special cases

• Set-based models (choose D)• Ranking models (choose )

• Independent loss • Relevance-based loss

• Distance-based loss• Dependent loss

• Maximum Margin Relevance loss • Maximal Diverse Relevance loss

Boolean model

Vector-space ModelTwo-stage LM

KL-divergence model

Probabilistic relevance model Generative Relevance Theory

CS@UVa

Subtopic retrieval model

Page 25: Language Models

CS 6501: Information Retrieval 25

Optimal ranking for independent loss

CS@UVa

1 11 1

1 1

1

1 1

1

1 1

1

1 1

* arg min ( , ) ( | , , , )

( , ) ( | ... )

( )

( ) ( )

* arg min ( ) ( ) ( | , , , )

arg min ( ) ( ) (

j j

j

j

j

j

N i

ii j

N i

ii j

N jN

ij i

N jN

ij i

N jN

ij i

L p q U C S d

L s l

s l

s l

s l p q U C S d

s l p

| , , , )

( | , , , ) ( ) ( | , , , )

* ( | , , , )

j j

k k k k

k

q U C S d

r d q U C S l p q U C S d

Ranking based on r d q U C S

Decision space = {rankings}

Sequential browsing

Independent loss

Independent risk= independent scoring

“Risk ranking principle”[Zhai 02]

is the probability that the user would stop reading after seeing the top documents

Page 26: Language Models

CS 6501: Information Retrieval 26

Automatic parameter tuning• Retrieval parameters are needed to

• model different user preferences• customize a retrieval model to specific queries and documents

• Retrieval parameters in traditional models• EXTERNAL to the model, hard to interpret• Parameters are introduced heuristically to implement “intuition”• No principles to quantify them, must set empirically through many

experiments• Still no guarantee for new queries/documents

• Language models make it possible to estimate parameters…

CS@UVa

Page 27: Language Models

CS 6501: Information Retrieval 27

Parameter setting in risk minimization

Query Query Language Model

Document Language Models

Loss Function User

Documents

Query model parameters

Doc model parameters

User model parameters

Estimate

Estimate

Set

CS@UVa

Page 28: Language Models

CS 6501: Information Retrieval 28

Summary of risk minimization • Risk minimization is a general probabilistic retrieval

framework• Retrieval as a decision problem (=risk min.)• Separate/flexible language models for queries and docs

• Advantages• A unified framework for existing models• Automatic parameter tuning due to LMs• Allows for modeling complex retrieval tasks

• Lots of potential for exploring LMs…

CS@UVa

Page 29: Language Models

CS 6501: Information Retrieval 29

What you should know

• EM algorithm• Risk minimization framework

CS@UVa