Language Models & Smoothing Shallow Processing Techniques for NLP Ling570 October 19, 2011

Language Models & Smoothing

Shallow Processing Techniques for NLPLing570

October 19, 2011

AnnouncementsCareer exploration talk: Bill McNeill

Thursday (10/20): 2:30-3:30pmThomson 135 & Online (Treehouse URL)

Treehouse meeting: Friday 10/21: 11-12Thesis topic brainstorming

GP Meeting: Friday 10/21: 3:30-5pmPCAR 291 & Online (…/clmagrad)

RoadmapNgram language models

Constructing language models

Generative language models

Evaluation:Training and TestingPerplexity

Smoothing:Laplace smoothingGood-Turing smoothing Interpolation & backoff

Ngram Language ModelsIndependence assumptions moderate data

Approximate probability given all prior wordsAssume finite historyUnigram: Probability of word in isolation Bigram: Probability of word given 1 previousTrigram: Probability of word given 2 previous

N-gram approximation

)|()|( 11

nn wwPwwP

)|()( 11

n wwPwPBigram sequence

Berkeley Restaurant Project Sentences

can you tell me about any good cantonese restaurants close by

mid priced thai food is what i’m looking for

tell me about chez panisse

can you give me a listing of the kinds of food that are available

i’m looking for a good place to eat breakfast

when is caffe venezia open during the day

Bigram CountsOut of 9222 sentences

Eg. “I want” occurred 827 times

Bigram ProbabilitiesDivide bigram counts by prefix unigram counts

to get probabilities.

Bigram Estimates of Sentence Probabilities

P(<s> I want english food </s>) =

P(i|<s>)*

P(want|I)*

P(english|want)*

P(food|english)*

P(</s>|food)

=.000031

Kinds of Knowledge

P(english|want) = .0011

P(chinese|want) = .0065

P(to|want) = .66

P(eat | to) = .28

P(food | to) = 0

P(want | spend) = 0

P (i | <s>) = .25

What types of knowledge are captured by ngram models?

Kinds of Knowledge

P(to|want) = .66

P(eat | to) = .28

P(food | to) = 0

P(want | spend) = 0

P (i | <s>) = .25

World knowledge

Kinds of Knowledge

P(to|want) = .66

P(eat | to) = .28

P(food | to) = 0

P(want | spend) = 0

P (i | <s>) = .25

World knowledge

Syntax

Kinds of Knowledge

P(to|want) = .66

P(eat | to) = .28

P(food | to) = 0

P(want | spend) = 0

P (i | <s>) = .25

World knowledge

Syntax

Discourse

Probabilistic Language Generation

Coin-flipping modelsA sentence is generated by a randomized

algorithmThe generator can be in one of several “states”Flip coins to choose the next stateFlip other coins to decide which letter or word to

output

Generated Language:Effects of N

1. Zero-order approximation:XFOML RXKXRJFFUJ ZLPWCFWKCYJ

FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD

2. First-order approximation:OCRO HLI RGWR NWIELWIS EU LL NBNESEBYA

TH EEI ALHENHTTPA OOBTTVA NAH RBL

2. First-order approximation:OCRO HLI RGWR NWIELWIS EU LL NBNESEBYA

TH EEI ALHENHTTPA OOBTTVA NAH RBL

3. Second-order approximation:ON IE ANTSOUTINYS ARE T INCTORE ST BE S

DEAMY ACHIND ILONASIVE TUCOOWE AT TEASONARE FUSO TIZIN ANDY TOBE SEACE CTISBE

Word Models: Effects of N1. First-order approximation:

REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE

Word Models: Effects of N1. First-order approximation:

REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE

2. Second-order approximation:THE HEAD AND IN FRONTAL ATTACK ON AN

ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED

Shakespeare

The Wall Street Journal is Not Shakespeare

Evaluation

Evaluation - GeneralEvaluation crucial for NLP systems

Required for most publishable results

Should be integrated early

Many factors:

Evaluation - GeneralEvaluation crucial for NLP systems

Required for most publishable results

Should be integrated early

Many factors:Data MetricsPrior results…..

Evaluation GuidelinesEvaluate your system

Use standard metrics

Use (standard) training/dev/test sets

Describing experiments: (Intrinsic vs Extrinsic)

Describing experiments: (Intrinsic vs Extrinsic)Clearly lay out experimental setting

Describing experiments: (Intrinsic vs Extrinsic)Clearly lay out experimental setting Compare to baseline and previous resultsPerform error analysis

Describing experiments: (Intrinsic vs Extrinsic)Clearly lay out experimental setting Compare to baseline and previous resultsPerform error analysisShow utility in real application (ideally)

Data OrganizationTraining:

Training data: used to learn model parameters

Training data: used to learn model parametersHeld-out data: used to tune additional parameters

Development (Dev) set:Used to evaluate system during development

Avoid overfitting

Development (Dev) set:Used to evaluate system during development

Avoid overfitting

Test data: Used for final, blind evaluation

Training data: used to learn model parameters Held-out data: used to tune additional parameters

Development (Dev) set: Used to evaluate system during development

Avoid overfitting

Test data: Used for final, blind evaluation

Typical division of data: 80/10/10 Tradeoffs Cross-validation

Evaluting LMsExtrinsic evaluation (aka in vivo)

Embed alternate models in systemSee which improves overall application

MT, IR, …

Intrinsic evaluation:Metric applied directly to model

Independent of larger applicationPerplexity

MT, IR, …

Intrinsic evaluation:Metric applied directly to model

Independent of larger applicationPerplexity

Why not just extrinsic?

Perplexity

Perplexity Intuition:

A better model will have tighter fit to test dataWill yield higher probability on test data

Formally,

For bigrams:

Formally,

For bigrams:

Inversely related to probability of sequenceHigher probability Lower perplexity

Formally,

For bigrams:

Inversely related to probability of sequenceHigher probability Lower perplexity

Can be viewed as average branching factor of model

Perplexity ExampleAlphabet: 0,1,…,9

Equiprobable

Perplexity ExampleAlphabet: 0,1,…,9;

Equiprobable: P(X)=1/10

PP(W)=

If probability of 0 is higher, PP(W) will be

PP(W)=

If probability of 0 is higher, PP(W) will be lower

Thinking about PerplexityGiven some vocabulary V with a uniform

distribution I.e. P(w) = 1/|V|

Under a unigram LM, the perplexity is

PP(W) =

Perplexity is effective branching factor of language

Perplexity and Entropy

Given that

Consider the perplexity equation:

PP(W) = P(W)-1/N =

Given that

PP(W) = P(W)-1/N =

Given that

PP(W) = P(W)-1/N = = =

Given that

PP(W) = P(W)-1/N = = = 2H(L,P)

Where H is the entropy of the language L

EntropyInformation theoretic measure

Measures information in grammar

Conceptually, lower bound on # bits to encode

Entropy: H(X): X is a random var, p: prob fn

)(log)()( 2 xpxpXHXx

Entropy: H(X): X is a random var, p: prob fn

E.g. 8 things: number as code => 3 bits/trans Alt. short code if high prob; longer if lower

Can reduce

)(log)()( 2 xpxpXHXx

Computing EntropyPicking horses (Cover and Thomas)

Send message: identify horse - 1 of 8If all horses equally likely, p(i)

Send message: identify horse - 1 of 8If all horses equally likely, p(i) = 1/8

Some horses more likely:1: ½; 2: ¼; 3: 1/8; 4: 1/16; 5,6,7,8: 1/64

38/1log8/1log8/1)(i

bitsXH

Some horses more likely:1: ½; 2: ¼; 3: 1/8; 4: 1/16; 5,6,7,8: 1/64

bitsipipXHi

2)(log)()(8

38/1log8/1log8/1)(i

bitsXH

Entropy of a SequenceBasic sequence

Entropy of language: infinite lengthsAssume stationary & ergodic

)(log)(1

nn WpWpn

),...,(log1

),...,(log),...,(1

wwpwwpn

Computing P(s): s is a sentence

Let s = w1w2….wn

Assume a bigram model

P(s) = P(w1w2…wn) = P(BOS w1w2….wnEOS)

Let s = w1w2….wn

~ P(BOS)*P(w1|BOS)*P(w2|w1)*…*P(wn|wn-1)*P(EOS|wn)

Out-of-vocabulary words (OOV): If n-gram contains OOV word,

Let s = w1w2….wn

~ P(BOS)*P(w1|BOS)*P(w2|w1)*…*P(wn|wn-1)*P(EOS|wn)

Remove n-gram from computationIncrement oov_count

Let s = w1w2….wn

Assume a trigram model

~P(w1|BOS)*P(w2|w1BOS)*…*P(wn|wn-2 wn-1)*P(EOS|wn-1wn)

Remove n-gram from computation Increment oov_count

N =sent_leng + 1 – oov_count

Computing PerplexityPP(W) =

Where W is a set of m sentences: s1,s2,…,sm

log P(W)

log P(W) =

N = word_count + sent_count – oov_count

Perplexity Model Comparison

Compare models with different history

Homework #4

Building Language ModelsStep 1: Count ngrams

Step 2: Build model – Compute probabilitiesMLESmoothed: Laplace, GT

Step 3: Compute perplexity

Steps 2 & 3 depend on model/smoothing choices

Q1: Counting N-gramsCollect real counts from the training data:

ngram_count.* training_data ngram_count_file

Output ngrams and real count c(w1), c(w1, w2), and c(w1, w2, w3).

Given a sentence: John called Mary Insert BOS and EOS: <s> John called Mary </s>

Q1: OutputCount key

875 a…200 the book…20 thank you very

In “chunks” – unigrams, then bigrams, then trigrams

Sort in decreasing order of count within chunk

Q2: Create Language Model

build_lm.* ngram_count_file lm_fileStore the logprob of ngrams and other parameters in

the lm

There are actually three language models: P(w3), P(w3|w2) and P(w3|w1,w2)The output file is in a modified ARPA format (see

next slide)Lines for n-grams are sorted by n-gram counts

Modified ARPA Format\data\

ngram 1: type = xx; token = yy

\1-grams:

count prob logprob w1

\2-grams:

count prob logprob w1 w2

\3-grams:

count prob logprob w1 w2 w3

# xx: is type count

# yy: is token count

# prob is P(w)

# prob is P(w2|w1)

#count in C(w1w2)

Q3: Calculating Perplexitypp.* lm_file n test_file outfile

Compute perplexity for n-gram history given model

sum=0; count=0;

for each s in test_file: if n-gram of history n exists

Compute P(wi|…wi-n+1)sum += log_2 P(wi…)count ++

total = -sum/count

pp(test_file) = 2total

Output format Sent #1: <s> Influential members of the House … </s>

1: log P(Influential | <s>) = -inf(unknown word)

2: log P(members | <s> Influential) = -inf (unseen ngrams)

4: log P(the | members of) = -0.673243382588536

1 sentence, 38 words, 9 OOVs

logprob=-82.8860891791949 ppl=721.341645452964

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

sent_num=50 word_num=1175 oov_num=190

logprob=-2854.78157013778 ave_logprob=-2.75824306293506 pp=573.116699237283

Q4: Compute PerplexityCompute perplexity for different n

Language Models & Smoothing Shallow Processing Techniques for NLP Ling570 October 19, 2011

Documents

Exponential Smoothing Methods.ppt - personal.cb.cityu.edu.hkpersonal.cb.cityu.edu.hk/msawan/teaching/ms6215/Exponential Smoothing... · Slide 4 Exponential Smoothing • Exponential

MaxEnt POS Tagging Shallow Processing Techniques for NLP Ling570 November 21, 2011

Exponential Smoothing Method - web.uettaxila.edu.pk€¦ · Exponential Smoothing with Trend Adjustment • Simple exponential smoothing - first-order smoothing • Trend adjusted

Bayesian Filtering and Smoothing - Cambridge University …assets.cambridge.org/97811070/30657/frontmatter/9781107030657... · Bayesian Filtering and Smoothing Filtering and smoothing

Classification & Mallet Shallow Processing Techniques for NLP Ling570 November 14, 2011

Speech & NLP (Fall 2014): N-Grams, N-Gram Computation, Word Sequence Probabilities, N-Gram Smoothing, Markov Models

Fall 2004 1 BASIC TECHNIQUES IN STATISTICAL NLP Word prediction n-grams smoothing

NLP Lunch Tutorial: Smoothingwcmac/papers/20050421-smoothing-tutorial.pdf · most NLP problems), this is generally undesirable. • Ex: a language model which gives probability 0

Introduction & Tokenization Ling570 Shallow Processing Techniques for NLP September 28, 2011

Sequence Classification: Chunking & NER Shallow Processing Techniques for NLP Ling570 November 23, 2011

Exponential Smoothing

The Evolution of NLP - Welcome to GWiz NLP - Gwiz NLP

ETC5410: Nonparametric smoothing methods · ETC5410: Nonparametric smoothing methods Smoothing splines 15 Cubic smoothing splines A cubic smoothing spline is the function ^r (x) which

Lecture 8: Nelson Aalen Estimator and Smoothing Kernel Smoothing Smoothing Splines

NLP Lunch Tutorial: Smoothing - Stanford NLP Groupnlp.stanford.edu/~wcmac/papers/20050421-smoothing...NLP Lunch Tutorial: Smoothing Bill MacCartney 21 April 2005 Preface • Everything

Nlp-Automata in Nlp

NLP Practitioner Heart of NLP - NLP Courses

Morphology, Phonology & FSTs Shallow Processing Techniques for NLP Ling570 October 12, 2011

Ngrams smoothing

Gaussian Kernel Smoothing - Semantic Scholar · Gaussian Kernel Smoothing Moo K. Chung mchung@stat.wisc.edu July 30, 2012 1 Kernel Smoothing Gaussian kernel smoothing most widely