Language Models Hongning Wang CS@UVa. Recap: document generation model CS@UVaCS 4501: Information Retrieval Model of relevant docs for Q Model of non-relevant

Language Models

Hongning WangCS@UVa

CS 4501: Information Retrieval 2

Recap: document generation model

CS@UVa

)0,|(

)1,|(

)0|()0,|(

)1|()1,|(

)0|,(

)1|,(),|1(

RQDP

RQDP

RQPRQDP

RQPRQDP

RDQP

RDQPDQROdd

Model of relevant docs for QModel of non-relevant docs for Q

Assume independent attributes of A1…Ak ….(why?)Let D=d1…dk, where dk {0,1} is the value of attribute Ak (Similarly Q=q1…qk )

k

di i

ik

di i

i

k

i ii

ii

iiRQAP

RQAP

RQAP

RQAP

RQdAP

RQdAPDQROdd

0,11,1

1

)0,|0(

)1,|0(

)0,|1(

)1,|1(

)0,|(

)1,|(),|1(

Terms occur in docTerms do not occur in doc

document relevant(R=1) nonrelevant(R=0)

term present Ai=1 pi ui

term absent Ai=0 1-pi 1-ui

Ignored for ranking


Recap: document generation model

CS@UVa

k

qi i

ik

qdi ii

ii

k

qdi i

ik

qdi i

i

k

di i

ik

di i

i

k

i ii

ii

iii

iiii

ii

u

p

pu

up

u

p

u

p

RQAP

RQAP

RQAP

RQAP

RQdAP

RQdAPDQROdd

1,11,1

1,0,11,1

0,11,1

1

1

1

)1(

)1(

1

1

)0,|0(

)1,|0(

)0,|1(

)1,|1(

)0,|(

)1,|(),|1(

Terms occur in docTerms do not occur in doc

document relevant(R=1) nonrelevant(R=0)

term present Ai=1 pi ui

term absent Ai=0 1-pi 1-ui

Assumption: terms not occurring in the query are equally likely to occur in relevant and nonrelevant documents, i.e., pt=ut

Important tricks


Recap: Maximum likelihood vs. Bayesian• Maximum likelihood estimation

– “Best” means “data likelihood reaches maximum”

– Issue: small sample size• Bayesian estimation

– “Best” means being consistent with our “prior” knowledge and explaining data well

– A.k.a, Maximum a Posterior estimation– Issue: how to define prior?

CS@UVa

�̂�=𝐚𝐫𝐠𝐦𝐚𝐱 𝜽𝐏 (𝐗∨𝜽 )

�̂�=𝐚𝐫𝐠𝐦𝐚𝐱𝜽 𝑷 (𝜽|𝑿 )=𝐚𝐫𝐠𝐦𝐚𝐱𝜽𝐏 (𝐗|𝜽 )𝐏 (𝜽)

ML: Frequentist’s point of view

MAP: Bayesian’s point of view


Recap: Robertson-Sparck Jones Model(Robertson & Sparck Jones 76)

CS@UVa

Two parameters for each term Ai: pi = P(Ai=1|Q,R=1): prob. that term Ai occurs in a relevant doc ui = P(Ai=1|Q,R=0): prob. that term Ai occurs in a non-relevant doc

k

qdi i

i

i

ik

qdi ii

iiRank

iiiiu

u

p

p

pu

upDQRO

1,11,1

1log

1log

)1(

)1(log),|1(log (RSJ model)

How to estimate these parameters?Suppose we have relevance judgments,

1).(#

5.0).(#ˆ

1).(#

5.0).(#ˆ

docnonrel

Awithdocnonrelu

docrel

Awithdocrelp i

ii

i

• “+0.5” and “+1” can be justified by Bayesian estimation as priors

Per-query estimation!


Recap: the BM25 formula

• A closer look

– is usually set to [0.75, 1.2]– is usually set to [1.2, 2.0]– is usually set to (0, 1000]

CS@UVa

i

i

i

in

ii qtfk

kqtf

DavgD

bbktf

ktfqIDFDqrel

2

2

1

1

1

)1(

)||

||1(

)1()(),(

TF-IDF component for document TF component for query

Vector space model with TF-IDF schema!


Notion of RelevanceRelevance

(Rep(q), Rep(d)) Similarity

P(r=1|q,d) r {0,1} Probability of Relevance

P(d q) or P(q d) Probabilistic inference

Different rep & similarity

Vector spacemodel(Salton et al., 75)

Prob. distr.model(Wong & Yao, 89)

…

Generative ModelRegressionModel(Fox 83)

Classicalprob. Model(Robertson & Sparck Jones, 76)

Docgeneration

Querygeneration

LM approach(Ponte & Croft, 98)(Lafferty & Zhai, 01a)

Prob. conceptspace model(Wong & Yao, 95)

Differentinference system

Inference network model(Turtle & Croft, 91)

CS@UVa


What is a statistical LM?

• A model specifying probability distribution over word sequences– p(“Today is Wednesday”) 0.001– p(“Today Wednesday is”) 0.0000000000001– p(“The eigenvalue is positive”) 0.00001

• It can be regarded as a probabilistic mechanism for “generating” text, thus also called a “generative” model

CS@UVa


Why is a LM useful?

• Provides a principled way to quantify the uncertainties associated with natural language

• Allows us to answer questions like:– Given that we see “John” and “feels”, how likely will we see “happy”

as opposed to “habit” as the next word? (speech recognition)– Given that we observe “baseball” three times and “game” once in a

news article, how likely is it about “sports”? (text categorization, information retrieval)– Given that a user is interested in sports news, how likely would the

user use “baseball” in a query? (information retrieval)

CS@UVa


Source-Channel framework [Shannon 48]

SourceTransmitter(encoder)

DestinationReceiver(decoder)

NoisyChannel

P(X)P(Y|X)

X Y X’

P(X|Y)=?

)()|(maxarg)|(maxargˆ XpXYpYXpXXX

When X is text, p(X) is a language model

(Bayes Rule)

Many Examples: Speech recognition: X=Word sequence Y=Speech signal

Machine translation: X=English sentence Y=Chinese sentenceOCR Error Correction: X=Correct word Y= Erroneous wordInformation Retrieval: X=Document Y=QuerySummarization: X=Summary Y=Document

CS@UVa


Language model for text

• Probability distribution over word sequences

– Complexity - • - maximum document length

CS@UVa

Chain rule: from conditional probability to joint probability

sentence

• A rough estimate:

• Average English sentence length is 14.3 words

• 475,000 main headwords in Webster's Third New International Dictionary

47500014

8𝑏𝑦𝑡𝑒𝑠× (1024 )4≈ 3.38𝑒66𝑇𝐵How large is this?

We need independence assumptions!


Unigram language model

• Generate a piece of text by generating each word independently

– • Essentially a multinomial distribution over the

vocabulary

w1 w2 w3 w4 w5 w6 w7 w8 w9w10

w11w12

w13w14

w15w16

w17w18

w19w20

w21w22

w23w24

w250

0.02

0.04

0.06

0.08

0.1

0.12

A Unigram Language Model

CS@UVa

The simplest and most popular choice!


More sophisticated LMs

• N-gram language models– In general, – N-gram: conditioned only on the past N-1 words– E.g., bigram:

• Remote-dependence language models (e.g., Maximum Entropy model)

• Structured language models (e.g., probabilistic context-free grammar)

CS@UVa


Why just unigram models?

• Difficulty in moving toward more complex models– They involve more parameters, so need more data to

estimate– They increase the computational complexity

significantly, both in time and space• Capturing word order or structure may not add so

much value for “topical inference”• But, using more sophisticated models can still be

expected to improve performance ...

CS@UVa


Generative view of text documents(Unigram) Language Model p(w| )

…text 0.2mining 0.1assocation 0.01clustering 0.02…food 0.00001…

Topic 1:Text mining

…text 0.01food 0.25nutrition 0.1healthy 0.05diet 0.02…

Topic 2:Health

Document

Text miningdocument

Food nutritiondocument

Sampling

CS@UVa


Sampling with replacement

• Pick a random shape, then put it back in the bag

CS@UVa


How to generate text document from an N-gram language model?

• Sample from a discrete distribution – Assume outcomes in the event space 1. Divide the interval [0,1] into intervals according

to the probabilities of the outcomes2. Generate a random number uniformly between

0 and 13. Return where falls into

CS@UVa

apple

arm

bananabike

bird

book

chin

clamclass

A Unigram Language Model


Generating text from language models

CS@UVa

Under a unigram language model:


Generating text from language models

CS@UVa

The same likelihood!Under a unigram language model:


N-gram language models will help

• Unigram– Months the my and issue of year foreign new exchange’s september

were recession exchange new endorsed a q acquire to six executives.• Bigram

– Last December through the way to preserve the Hudson corporation N.B.E.C. Taylor would seem to complete the major central planners one point five percent of U.S.E. has already told M.X. corporation of living on information such as more frequently fishing to keep her.

• Trigram– They also point to ninety nine point six billon dollars from two

hundred four oh six three percent of the rates of interest stores as Mexico and Brazil on market conditions.

CS@UVa

Generated from language models of New York Times


Turing test: generating Shakespeare

CS@UVa

A

B

C

D

SCIgen - An Automatic CS Paper Generator

http://pdos.csail.mit.edu/scigen/


Recap: what is a statistical LM?

• A model specifying probability distribution over word sequences– p(“Today is Wednesday”) 0.001– p(“Today Wednesday is”) 0.0000000000001– p(“The eigenvalue is positive”) 0.00001

• It can be regarded as a probabilistic mechanism for “generating” text, thus also called a “generative” model

CS@UVa


Recap: Source-Channel framework [Shannon 48]

SourceTransmitter(encoder)

DestinationReceiver(decoder)

NoisyChannel

P(X)P(Y|X)

X Y X’

P(X|Y)=?

)()|(maxarg)|(maxargˆ XpXYpYXpXXX

When X is text, p(X) is a language model

(Bayes Rule)

Many Examples: Speech recognition: X=Word sequence Y=Speech signal

Machine translation: X=English sentence Y=Chinese sentenceOCR Error Correction: X=Correct word Y= Erroneous wordInformation Retrieval: X=Document Y=QuerySummarization: X=Summary Y=Document

CS@UVa


Recap: language model for text

• Probability distribution over word sequences

– Complexity - • - maximum document length

CS@UVa

Chain rule: from conditional probability to joint probability

sentence

• A rough estimate:

• Average English sentence length is 14.3 words

• 475,000 main headwords in Webster's Third New International Dictionary

47500014

8𝑏𝑦𝑡𝑒𝑠× (1024 )4≈ 3.38𝑒66𝑇𝐵How large is this?

We need independence assumptions!


Recap: how to generate text document from an N-gram language model?

• Sample from a discrete distribution – Assume outcomes in the event space 1. Divide the interval [0,1] into intervals according

to the probabilities of the outcomes2. Generate a random number uniformly between

0 and 13. Return where falls into

CS@UVa


Estimation of language models

CS@UVa

Document

text 10mining 5association 3database 3algorithm 2…query 1efficient 1

Unigram Language Model p(w| )=?

…text ?mining ?assocation ?database ?…query ?…

Estimation

A “text mining” paper(total #words=100)


Sampling with replacement

• Pick a random shape, then put it back in the bag

CS@UVa


Estimation of language models

• Maximum likelihood estimationUnigram Language Model p(w| )=? Document

text 10mining 5association 3database 3algorithm 2…query 1efficient 1

…text ?mining ?assocation ?database ?…query ?…

Estimation

A “text mining” paper(total #words=100)

10/1005/1003/1003/100

1/100

CS@UVa


Maximum likelihood estimation

• For N-gram language models

CS@UVa

Length of document or total number of words in a corpus


Language models for IR[Ponte & Croft SIGIR’98]

Document

Text miningpaper

Food nutritionpaper

Language Model

…text ?mining ?assocation ?clustering ?…food ?…

…food ?nutrition ?healthy ?diet ?…

?Which model would most likely have generated this query?

“data mining algorithms”

Query

CS@UVa


Ranking docs by query likelihood

d1

d2

dN

qp(q| d1

)

p(q| d2)

p(q| dN)

Query likelihood

……

d1

d2

dN

Doc LM

……

)1,|(),|1( RDQPDQROJustification: PRP

CS@UVa


Justification from PRP

))0|()0,|(()0|(

)1|()1,|(

)0|()0,|(

)1|()1,|(

)0|,(

)1|,(),|1(

RQPRDQPAssumeRDP

RDPRDQP

RDPRDQP

RDPRDQP

RDQP

RDQPDQRO

Query likelihood p(q| d) Document prior Assuming uniform document prior, we have

)1,|(),|1( RDQPDQRO

CS@UVa

Query generation


Retrieval as language model estimation

• Document ranking based on query likelihood

– Retrieval problem Estimation of – Common approach

• Maximum likelihood estimation (MLE)

n

ii

wwwqwhere

dwpdqp

...,

)|(log)|(log

21

Document language model

CS@UVa


Problem with MLE

• Unseen events– There are 440K tokens on a larger collection of

Yelp reviews, but:• Only 30,000 unique words occurred• Only 0.04% of all possible bigrams occurred

This means any word/N-gram that does not occur in the collection has zero probability with MLE!

No future documents can contain those unseen words/N-grams

CS@UVa

A plot of word frequency in Wikipedia (Nov 27, 2006)

Wor

d fr

eque

ncy

Word rank by frequency

𝑓 (𝑘; 𝑠 ,𝑁 )∝1 /𝑘𝑠


Problem with MLE

• What probability should we give a word that has not been observed in the document?– log0?

• If we want to assign non-zero probabilities to such words, we’ll have to discount the probabilities of observed words

• This is so-called “smoothing”

CS@UVa


General idea of smoothing

• All smoothing methods try to1. Discount the probability of words seen in a

document2. Re-allocate the extra counts such that unseen

words will have a non-zero count

CS@UVa


Illustration of language model smoothing

P(w|d)

w

Max. Likelihood Estimate

wordsallofcountwofcount

ML wp )(

Smoothed LM

Assigning nonzero probabilities to the unseen words

CS@UVa

Discount from the seen words


Smoothing methods

• Method 1: Additive smoothing – Add a constant to the counts of each word

– Problems?• Hint: all words are equally important?

( , ) 1( | )

| | | |

c w dp w d

d V

“Add one”, Laplace smoothing

Vocabulary size

Counts of w in d

Length of d (total counts)

CS@UVa


Add one smoothing for bigrams

CS@UVa


After smoothing

• Giving too much to the unseen events

CS@UVa


Refine the idea of smoothing

• Should all unseen words get equal probabilities?

• We can use a reference model to discriminate unseen words

( | )( | )

( | )seen

d

p w d if w is seen in dp w d

p w REF otherwise

Discounted ML estimate

Reference language model1 ( | )

( | )

seenw is seen

d

w is unseen

p w d

p w REF

CS@UVa


Smoothing methods

• Method 2: Absolute discounting – Subtract a constant from the counts of each

word

– Problems? • Hint: varied document length?

max( ( ; ) ,0) | | ( | )| |( | ) uc w d d p w REFdp w d

# uniq words

CS@UVa


Smoothing methods

• Method 3: Linear interpolation, Jelinek-Mercer – “Shrink” uniformly toward p(w|REF)

– Problems?• Hint: what is missing?

( , )( | ) (1 ) ( | )

| |

c w dp w d p w REF

d

parameterMLE

CS@UVa


Smoothing methods

• Method 4: Dirichlet Prior/Bayesian – Assume pseudo counts p(w|REF)

– Problems?

( ; ) ( | ) | || | | | | |

( , )( | ) ( | )

| |c w d p w REF d

d d d

c w dp w d p w REF

d

parameter

CS@UVa


Dirichlet prior smoothing

• Bayesian estimator – Posterior of LM:

• Conjugate prior • Posterior will be in the same form as prior• Prior can be interpreted as “extra”/“pseudo” data

• Dirichlet distribution is a conjugate prior for multinomial distribution

1

11

11 )()(

)(),,|(

i

i

N

iN

NNDir

“extra”/“pseudo” word counts, we set i= p(wi|REF)

prior over models

likelihood of doc given the model

CS@UVa


Some background knowledge

• Conjugate prior– Posterior dist in the same

family as prior• Dirichlet distribution

– Continuous– Samples from it will be the

parameters in a multinomial distribution

Gaussian -> GaussianBeta -> BinomialDirichlet -> Multinomial

CS@UVa


Dirichlet prior smoothing (cont)

))( , ,)( | () | ( 11 NNwcwcDirdp Posterior distribution of parameters:

}{)E(then ),|(~ If :Property i

iDir

The predictive distribution is the same as the mean:

|d|

)|()w(

|d|

)w(

)|()|p(w)ˆ|p(w

i

1i

i

ii

REFwpcc

dDir

iN

i

i

Dirichlet prior smoothingCS@UVa


Estimating using leave-one-out [Zhai & Lafferty 02]

P(w1|d- w1)

P(w2|d- w2)

N

i Vw i

ii d

CwpdwcdwcCL

11 )

1||

)|(1),(log(),()|(

log-likelihood

)C|(μLargmaxμ̂ 1μ

Maximum Likelihood Estimator

Leave-one-outw1

w2

P(wn|d- wn)

wn

...

CS@UVa


Why would “leave-one-out” work?

abc abc ab c d dabc cd d d abd ab ab ab abcd d e cd e

20 word by author1

20 word by author2

abc abc ab c d dabe cb e f acf fb ef aff abefcdc db ge f s

Now, suppose we leave “e” out…

1 20 1(" " | 1) (" " | 1) (" " | )

19 20 19 20

0 20 0(" " | 2) (" " | 2) (" " | )

19 20 19 20

ml smooth

ml smooth

p e author p e author p e REF

p e author p e author p e REF

must be big! more smoothing

doesn’t have to be big

The amount of smoothing is closely related to the underlying vocabulary size

CS@UVa


Recap: ranking docs by query likelihood

d1

d2

dN

qp(q| d1

)

p(q| d2)

p(q| dN)

Query likelihood

……

d1

d2

dN

Doc LM

……

)1,|(),|1( RDQPDQROJustification: PRP

CS@UVa


Recap: retrieval as language model estimation

• Document ranking based on query likelihood

– Retrieval problem Estimation of – Common approach

• Maximum likelihood estimation (MLE)

n

ii

wwwqwhere

dwpdqp

...,

)|(log)|(log

21

Document language model

CS@UVa


Recap: illustration of language model smoothing

P(w|d)

w

Max. Likelihood Estimate

wordsallofcountwofcount

ML wp )(

Smoothed LM

Assigning nonzero probabilities to the unseen words

CS@UVa

Discount from the seen words


Recap: smoothing methods

• Method 1: Additive smoothing – Add a constant to the counts of each word

– Problems?• Hint: all words are equally important?

( , ) 1( | )

| | | |

c w dp w d

d V

“Add one”, Laplace smoothing

Vocabulary size

Counts of w in d

Length of d (total counts)

CS@UVa


Recap: refine the idea of smoothing

• Should all unseen words get equal probabilities?

• We can use a reference model to discriminate unseen words

( | )( | )

( | )seen

d


p w REF otherwise

Discounted ML estimate

Reference language model1 ( | )

( | )

seenw is seen

d

w is unseen

p w d

p w REF

CS@UVa



• Method 2: Absolute discounting – Subtract a constant from the counts of each

word

– Problems? • Hint: varied document length?

max( ( ; ) ,0) | | ( | )| |( | ) uc w d d p w REFdp w d

# uniq words

CS@UVa



• Method 3: Linear interpolation, Jelinek-Mercer – “Shrink” uniformly toward p(w|REF)

– Problems?• Hint: what is missing?

( , )( | ) (1 ) ( | )

| |

c w dp w d p w REF

d

parameterMLE

CS@UVa



• Method 4: Dirichlet Prior/Bayesian – Assume pseudo counts p(w|REF)

– Problems?

( ; ) ( | ) | || | | | | |

( , )( | ) ( | )

| |c w d p w REF d

d d d

c w dp w d p w REF

d

parameter

CS@UVa


Recap: understanding smoothingQuery = “the algorithms for data mining”

p( “algorithms”|d1) = p(“algorithms”|d2)p( “data”|d1) < p(“data”|d2)p( “mining”|d1) < p(“mining”|d2)

So we should make p(“the”) and p(“for”) less different for all docs, and smoothing helps to achieve this goal…

Topical words

Intuitively, d2 should have a higher score, but p(q|d1)>p(q|d2)…

pML(w|d1): 0.04 0.001 0.02 0.002 0.003 pML(w|d2): 0.02 0.001 0.01 0.003 0.004

Query = “the algorithms for data mining”P(w|REF) 0.2 0.00001 0.2 0.00001 0.00001Smoothed p(w|d1): 0.184 0.000109 0.182 0.000209 0.000309Smoothed p(w|d2): 0.182 0.000109 0.181 0.000309 0.000409

)!2|()1|(),|(9.0)|(1.0)|( dqpdqpREFwpdwpdwpwithsmoothingAfter DML

CS@UVa

4.8 ×10− 12

2.4 × 10−12

2.35 ×10−13

4.53 ×10− 13


Two-stage smoothing [Zhai & Lafferty 02]

c(w,d)

|d|P(w|d) =

+p(w|C)

+

Stage-1

-Explain unseen words-Dirichlet prior (Bayesian)

Collection LM

(1-) + p(w|U)

Stage-2

-Explain noise in query-2-component mixture

User background model

CS@UVa


Understanding smoothing

• Plug in the general smoothing scheme to the query likelihood retrieval formula, we obtain

qw

idqdw id

iseen

ii

CwpqCwp

dwpdqp )|(loglog||]

)|(

)|([log)|(log

Ignore for rankingIDF weighting

TF weightingDoc length normalization(longer doc is expected to have a smaller d)

• Smoothing with p(w|C) TF-IDF + doc-length normalization

CS@UVa

1 ( | )

( | )

seenw is seen

d

w is unseen

p w d

p w REF


, ( , ) 0

, ( , ) 0, , ( , ) 0, ( , ) 0( , ) 0

, ( , ) 0 , ( , )

log ( | ) ( , ) log ( | )

( , ) log ( | ) ( , ) log ( | )

( , ) log ( | ) ( , ) log ( | ) ( , ) log ( | )

w V c w q

Seen dw V c w d w V c w q c w dc w q

Seen d dw V c w d w V c w q

p q d c w q p w d

c w q p w d c w q p w C

c w q p w d c w q p w C c w q p w C

, ( , ) 0 0, ( , ) 0

, ( , ) 0 , ( , ) 0( , ) 0

( | )( , ) log | | log ( , ) ( | )

( | )

w V c w q c w d

Seend

w V c w d w V c w qdc w q

p w dc w q q c w q p w C

p w C

Smoothing & TF-IDF weighting

( | )( | )

( | )Seen

d


p w C otherwise

1 ( | )

( | )

Seenw is seen

d

w is unseen

p w d

p w C

Smoothed ML estimate

Reference language model

Retrieval formula using the general smoothing scheme

Key rewriting step (where did we see it before?)

CS@UVa

Similar rewritings are very common when using probabilistic models for IR…

𝑐 (𝑤 ,𝑞 )>0


What you should know

• How to estimate a language model• General idea and different ways of smoothing• Effect of smoothing

CS@UVa


Today’s reading

• Introduction to information retrieval– Chapter 12: Language models for information

retrieval

CS@UVa

Documents

Language Models Hongning Wang CS@UVa. Recap: document generation model CS@UVaCS 4501: Information Retrieval Model of relevant docs for Q Model of non-relevant