Language Models & Smoothing Shallow Processing Techniques for NLP Ling570 October 19, 2011

Preview:

Citation preview

Language Models & Smoothing

Shallow Processing Techniques for NLPLing570

October 19, 2011

AnnouncementsCareer exploration talk: Bill McNeill

Thursday (10/20): 2:30-3:30pmThomson 135 & Online (Treehouse URL)

Treehouse meeting: Friday 10/21: 11-12Thesis topic brainstorming

GP Meeting: Friday 10/21: 3:30-5pmPCAR 291 & Online (…/clmagrad)

RoadmapNgram language models

Constructing language models

Generative language models

Evaluation:Training and TestingPerplexity

Smoothing:Laplace smoothingGood-Turing smoothing Interpolation & backoff

Ngram Language ModelsIndependence assumptions moderate data

needs

Approximate probability given all prior wordsAssume finite historyUnigram: Probability of word in isolation Bigram: Probability of word given 1 previousTrigram: Probability of word given 2 previous

N-gram approximation

)|()|( 11

11

nNnn

nn wwPwwP

)|()( 11

1 k

n

kk

n wwPwPBigram sequence

Berkeley Restaurant Project Sentences

can you tell me about any good cantonese restaurants close by

mid priced thai food is what i’m looking for

tell me about chez panisse

can you give me a listing of the kinds of food that are available

i’m looking for a good place to eat breakfast

when is caffe venezia open during the day

Bigram CountsOut of 9222 sentences

Eg. “I want” occurred 827 times

Bigram ProbabilitiesDivide bigram counts by prefix unigram counts

to get probabilities.

Bigram Estimates of Sentence Probabilities

P(<s> I want english food </s>) =

P(i|<s>)*

P(want|I)*

P(english|want)*

P(food|english)*

P(</s>|food)

=.000031

Kinds of Knowledge

P(english|want) = .0011

P(chinese|want) = .0065

P(to|want) = .66

P(eat | to) = .28

P(food | to) = 0

P(want | spend) = 0

P (i | <s>) = .25

What types of knowledge are captured by ngram models?

Kinds of Knowledge

P(english|want) = .0011

P(chinese|want) = .0065

P(to|want) = .66

P(eat | to) = .28

P(food | to) = 0

P(want | spend) = 0

P (i | <s>) = .25

World knowledge

What types of knowledge are captured by ngram models?

Kinds of Knowledge

P(english|want) = .0011

P(chinese|want) = .0065

P(to|want) = .66

P(eat | to) = .28

P(food | to) = 0

P(want | spend) = 0

P (i | <s>) = .25

World knowledge

Syntax

What types of knowledge are captured by ngram models?

Kinds of Knowledge

P(english|want) = .0011

P(chinese|want) = .0065

P(to|want) = .66

P(eat | to) = .28

P(food | to) = 0

P(want | spend) = 0

P (i | <s>) = .25

World knowledge

Syntax

Discourse

What types of knowledge are captured by ngram models?

Probabilistic Language Generation

Coin-flipping modelsA sentence is generated by a randomized

algorithmThe generator can be in one of several “states”Flip coins to choose the next stateFlip other coins to decide which letter or word to

output

Generated Language:Effects of N

1. Zero-order approximation:XFOML RXKXRJFFUJ ZLPWCFWKCYJ

FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD

Generated Language:Effects of N

1. Zero-order approximation:XFOML RXKXRJFFUJ ZLPWCFWKCYJ

FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD

2. First-order approximation:OCRO HLI RGWR NWIELWIS EU LL NBNESEBYA

TH EEI ALHENHTTPA OOBTTVA NAH RBL

Generated Language:Effects of N

1. Zero-order approximation:XFOML RXKXRJFFUJ ZLPWCFWKCYJ

FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD

2. First-order approximation:OCRO HLI RGWR NWIELWIS EU LL NBNESEBYA

TH EEI ALHENHTTPA OOBTTVA NAH RBL

3. Second-order approximation:ON IE ANTSOUTINYS ARE T INCTORE ST BE S

DEAMY ACHIND ILONASIVE TUCOOWE AT TEASONARE FUSO TIZIN ANDY TOBE SEACE CTISBE

Word Models: Effects of N1. First-order approximation:

REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE

Word Models: Effects of N1. First-order approximation:

REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE

2. Second-order approximation:THE HEAD AND IN FRONTAL ATTACK ON AN

ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED

Shakespeare

The Wall Street Journal is Not Shakespeare

Evaluation

Evaluation - GeneralEvaluation crucial for NLP systems

Required for most publishable results

Should be integrated early

Many factors:

Evaluation - GeneralEvaluation crucial for NLP systems

Required for most publishable results

Should be integrated early

Many factors:Data MetricsPrior results…..

Evaluation GuidelinesEvaluate your system

Use standard metrics

Use (standard) training/dev/test sets

Describing experiments: (Intrinsic vs Extrinsic)

Evaluation GuidelinesEvaluate your system

Use standard metrics

Use (standard) training/dev/test sets

Describing experiments: (Intrinsic vs Extrinsic)Clearly lay out experimental setting

Evaluation GuidelinesEvaluate your system

Use standard metrics

Use (standard) training/dev/test sets

Describing experiments: (Intrinsic vs Extrinsic)Clearly lay out experimental setting Compare to baseline and previous resultsPerform error analysis

Evaluation GuidelinesEvaluate your system

Use standard metrics

Use (standard) training/dev/test sets

Describing experiments: (Intrinsic vs Extrinsic)Clearly lay out experimental setting Compare to baseline and previous resultsPerform error analysisShow utility in real application (ideally)

Data OrganizationTraining:

Training data: used to learn model parameters

Data OrganizationTraining:

Training data: used to learn model parametersHeld-out data: used to tune additional parameters

Data OrganizationTraining:

Training data: used to learn model parametersHeld-out data: used to tune additional parameters

Development (Dev) set:Used to evaluate system during development

Avoid overfitting

Data OrganizationTraining:

Training data: used to learn model parametersHeld-out data: used to tune additional parameters

Development (Dev) set:Used to evaluate system during development

Avoid overfitting

Test data: Used for final, blind evaluation

Data OrganizationTraining:

Training data: used to learn model parameters Held-out data: used to tune additional parameters

Development (Dev) set: Used to evaluate system during development

Avoid overfitting

Test data: Used for final, blind evaluation

Typical division of data: 80/10/10 Tradeoffs Cross-validation

Evaluting LMsExtrinsic evaluation (aka in vivo)

Embed alternate models in systemSee which improves overall application

MT, IR, …

Evaluting LMsExtrinsic evaluation (aka in vivo)

Embed alternate models in systemSee which improves overall application

MT, IR, …

Intrinsic evaluation:Metric applied directly to model

Independent of larger applicationPerplexity

Evaluting LMsExtrinsic evaluation (aka in vivo)

Embed alternate models in systemSee which improves overall application

MT, IR, …

Intrinsic evaluation:Metric applied directly to model

Independent of larger applicationPerplexity

Why not just extrinsic?

Perplexity

Perplexity Intuition:

A better model will have tighter fit to test dataWill yield higher probability on test data

Perplexity Intuition:

A better model will have tighter fit to test dataWill yield higher probability on test data

Formally,

Perplexity Intuition:

A better model will have tighter fit to test dataWill yield higher probability on test data

Formally,

Perplexity Intuition:

A better model will have tighter fit to test dataWill yield higher probability on test data

Formally,

Perplexity Intuition:

A better model will have tighter fit to test dataWill yield higher probability on test data

Formally,

For bigrams:

Perplexity Intuition:

A better model will have tighter fit to test dataWill yield higher probability on test data

Formally,

For bigrams:

Inversely related to probability of sequenceHigher probability Lower perplexity

Perplexity Intuition:

A better model will have tighter fit to test dataWill yield higher probability on test data

Formally,

For bigrams:

Inversely related to probability of sequenceHigher probability Lower perplexity

Can be viewed as average branching factor of model

Perplexity ExampleAlphabet: 0,1,…,9

Equiprobable

Perplexity ExampleAlphabet: 0,1,…,9;

Equiprobable: P(X)=1/10

Perplexity ExampleAlphabet: 0,1,…,9;

Equiprobable: P(X)=1/10

PP(W)=

Perplexity ExampleAlphabet: 0,1,…,9;

Equiprobable: P(X)=1/10

PP(W)=

If probability of 0 is higher, PP(W) will be

Perplexity ExampleAlphabet: 0,1,…,9;

Equiprobable: P(X)=1/10

PP(W)=

If probability of 0 is higher, PP(W) will be lower

Thinking about PerplexityGiven some vocabulary V with a uniform

distribution I.e. P(w) = 1/|V|

Thinking about PerplexityGiven some vocabulary V with a uniform

distribution I.e. P(w) = 1/|V|

Under a unigram LM, the perplexity is

PP(W) =

Thinking about PerplexityGiven some vocabulary V with a uniform

distribution I.e. P(w) = 1/|V|

Under a unigram LM, the perplexity is

PP(W) =

Thinking about PerplexityGiven some vocabulary V with a uniform

distribution I.e. P(w) = 1/|V|

Under a unigram LM, the perplexity is

PP(W) =

Perplexity is effective branching factor of language

Perplexity and Entropy

Given that

Consider the perplexity equation:

PP(W) = P(W)-1/N =

Perplexity and Entropy

Given that

Consider the perplexity equation:

PP(W) = P(W)-1/N =

Perplexity and Entropy

Given that

Consider the perplexity equation:

PP(W) = P(W)-1/N = = =

Perplexity and Entropy

Given that

Consider the perplexity equation:

PP(W) = P(W)-1/N = = = 2H(L,P)

Where H is the entropy of the language L

EntropyInformation theoretic measure

Measures information in grammar

Conceptually, lower bound on # bits to encode

EntropyInformation theoretic measure

Measures information in grammar

Conceptually, lower bound on # bits to encode

Entropy: H(X): X is a random var, p: prob fn

)(log)()( 2 xpxpXHXx

EntropyInformation theoretic measure

Measures information in grammar

Conceptually, lower bound on # bits to encode

Entropy: H(X): X is a random var, p: prob fn

E.g. 8 things: number as code => 3 bits/trans Alt. short code if high prob; longer if lower

Can reduce

)(log)()( 2 xpxpXHXx

Computing EntropyPicking horses (Cover and Thomas)

Send message: identify horse - 1 of 8If all horses equally likely, p(i)

Computing EntropyPicking horses (Cover and Thomas)

Send message: identify horse - 1 of 8If all horses equally likely, p(i) = 1/8

Computing EntropyPicking horses (Cover and Thomas)

Send message: identify horse - 1 of 8If all horses equally likely, p(i) = 1/8

Some horses more likely:1: ½; 2: ¼; 3: 1/8; 4: 1/16; 5,6,7,8: 1/64

8

1

38/1log8/1log8/1)(i

bitsXH

Computing EntropyPicking horses (Cover and Thomas)

Send message: identify horse - 1 of 8If all horses equally likely, p(i) = 1/8

Some horses more likely:1: ½; 2: ¼; 3: 1/8; 4: 1/16; 5,6,7,8: 1/64

bitsipipXHi

2)(log)()(8

1

8

1

38/1log8/1log8/1)(i

bitsXH

Entropy of a SequenceBasic sequence

Entropy of language: infinite lengthsAssume stationary & ergodic

)(log)(1

)(1

1211

1

n

LW

nn WpWpn

WHn n

),...,(log1

lim)(

),...,(log),...,(1

lim)(

1

11

nn

nLW

nn

wwpn

LH

wwpwwpn

LH

Computing P(s): s is a sentence

Let s = w1w2….wn

Assume a bigram model

P(s) = P(w1w2…wn) = P(BOS w1w2….wnEOS)

Computing P(s): s is a sentence

Let s = w1w2….wn

Assume a bigram model

P(s) = P(w1w2…wn) = P(BOS w1w2….wnEOS)

~ P(BOS)*P(w1|BOS)*P(w2|w1)*…*P(wn|wn-1)*P(EOS|wn)

Out-of-vocabulary words (OOV): If n-gram contains OOV word,

Computing P(s): s is a sentence

Let s = w1w2….wn

Assume a bigram model

P(s) = P(w1w2…wn) = P(BOS w1w2….wnEOS)

~ P(BOS)*P(w1|BOS)*P(w2|w1)*…*P(wn|wn-1)*P(EOS|wn)

Out-of-vocabulary words (OOV): If n-gram contains OOV word,

Remove n-gram from computationIncrement oov_count

N

Computing P(s): s is a sentence

Let s = w1w2….wn

Assume a trigram model

P(s) = P(w1w2…wn) = P(BOS w1w2….wnEOS)

~P(w1|BOS)*P(w2|w1BOS)*…*P(wn|wn-2 wn-1)*P(EOS|wn-1wn)

Out-of-vocabulary words (OOV): If n-gram contains OOV word,

Remove n-gram from computation Increment oov_count

N =sent_leng + 1 – oov_count

Computing PerplexityPP(W) =

Where W is a set of m sentences: s1,s2,…,sm

log P(W)

Computing PerplexityPP(W) =

Where W is a set of m sentences: s1,s2,…,sm

log P(W) =

Computing PerplexityPP(W) =

Where W is a set of m sentences: s1,s2,…,sm

log P(W) =

N

Computing PerplexityPP(W) =

Where W is a set of m sentences: s1,s2,…,sm

log P(W) =

N = word_count + sent_count – oov_count

Perplexity Model Comparison

Compare models with different history

Homework #4

Building Language ModelsStep 1: Count ngrams

Step 2: Build model – Compute probabilitiesMLESmoothed: Laplace, GT

Step 3: Compute perplexity

Steps 2 & 3 depend on model/smoothing choices

Q1: Counting N-gramsCollect real counts from the training data:

ngram_count.* training_data ngram_count_file

Output ngrams and real count c(w1), c(w1, w2), and c(w1, w2, w3).

Given a sentence: John called Mary Insert BOS and EOS: <s> John called Mary </s>

Q1: OutputCount key

875 a…200 the book…20 thank you very

In “chunks” – unigrams, then bigrams, then trigrams

Sort in decreasing order of count within chunk

Q2: Create Language Model

build_lm.* ngram_count_file lm_fileStore the logprob of ngrams and other parameters in

the lm

There are actually three language models: P(w3), P(w3|w2) and P(w3|w1,w2)The output file is in a modified ARPA format (see

next slide)Lines for n-grams are sorted by n-gram counts

Modified ARPA Format\data\

ngram 1: type = xx; token = yy

ngram 2: type = xx; token = yy

ngram 3: type = xx; token = yy

\1-grams:

count prob logprob w1

\2-grams:

count prob logprob w1 w2

\3-grams:

count prob logprob w1 w2 w3

# xx: is type count

# yy: is token count

# prob is P(w)

# prob is P(w2|w1)

#count in C(w1w2)

Q3: Calculating Perplexitypp.* lm_file n test_file outfile

Compute perplexity for n-gram history given model

sum=0; count=0;

for each s in test_file: if n-gram of history n exists

Compute P(wi|…wi-n+1)sum += log_2 P(wi…)count ++

total = -sum/count

pp(test_file) = 2total

Output format Sent #1: <s> Influential members of the House … </s>

1: log P(Influential | <s>) = -inf(unknown word)

2: log P(members | <s> Influential) = -inf (unseen ngrams)

4: log P(the | members of) = -0.673243382588536

1 sentence, 38 words, 9 OOVs

logprob=-82.8860891791949 ppl=721.341645452964

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

sent_num=50 word_num=1175 oov_num=190

logprob=-2854.78157013778 ave_logprob=-2.75824306293506 pp=573.116699237283

Q4: Compute PerplexityCompute perplexity for different n

Recommended