59
Statistical Methods for NLP Diana Trandabăț 2019-2020

Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

Statistical Methods for NLP

Diana Trandabăț

2019-2020

Page 2: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

Noisy Channel Model

Translation model P(f|e)

Language model P(e)

Sent Message in source language (f)

Ce foame am

Received Message in target language (e)

What hunger have I Hungry am I so I am so hungry Have I that hunger ...

I am so hungry

Bilingual text

Monolingual text

Statistical Analysis

Statistical Analysis

Broken Message in target language

Page 3: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

Uses of Language Models • Speech recognition

– “I ate a cherry” is a more likely sentence than “Eye eight uh Jerry”

• OCR & Handwriting recognition – More probable sentences are more likely correct readings.

• Machine translation – More likely sentences are probably better translations.

• Generation – More likely sentences are probably better NL generations.

• Context sensitive spelling correction – “Their are problems wit this sentence.”

Page 4: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

Completion Prediction

• A language model also supports predicting the completion of a sentence.

– Please turn off your cell _____

– Your program does not ______

• Predictive text input systems can guess what you are typing and give choices on how to complete it.

Page 5: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

Letter-based Language Models

• Shannon’s Game

• Guess the next letter:

Page 6: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

Letter-based Language Models

• Shannon’s Game

• Guess the next letter:

• W

Page 7: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

Letter-based Language Models

• Shannon’s Game

• Guess the next letter:

• Wh

Page 8: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

• Shannon’s Game

• Guess the next letter:

• Wha

Letter-based Language Models

Page 9: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

• Shannon’s Game

• Guess the next letter:

• What

Letter-based Language Models

Page 10: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

• Shannon’s Game

• Guess the next letter:

• What d

Letter-based Language Models

Page 11: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

• Shannon’s Game

• Guess the next letter:

• What do

Letter-based Language Models

Page 12: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

• Shannon’s Game

• Guess the next letter:

• What do you think the next letter is?

Letter-based Language Models

Page 13: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

• Shannon’s Game

• Guess the next letter:

• What do you think the next letter is?

• Guess the next word:

Letter-based Language Models

Page 14: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

• Shannon’s Game

• Guess the next letter:

• What do you think the next letter is?

• Guess the next word:

• What

Letter-based Language Models

Page 15: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

• Shannon’s Game

• Guess the next letter:

• What do you think the next letter is?

• Guess the next word:

• What do

Letter-based Language Models

Page 16: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

• Shannon’s Game

• Guess the next letter:

• What do you think the next letter is?

• Guess the next word:

• What do you

Letter-based Language Models

Page 17: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

• Shannon’s Game

• Guess the next letter:

• What do you think the next letter is?

• Guess the next word:

• What do you think

Letter-based Language Models

Page 18: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

• Shannon’s Game

• Guess the next letter:

• What do you think the next letter is?

• Guess the next word:

• What do you think the

Letter-based Language Models

Page 19: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

• Shannon’s Game

• Guess the next letter:

• What do you think the next letter is?

• Guess the next word:

• What do you think the next

Letter-based Language Models

Page 20: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

• Shannon’s Game

• Guess the next letter:

• What do you think the next letter is?

• Guess the next word:

• What do you think the next word is?

Letter-based Language Models

Page 21: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

Real Word Spelling Errors

• They are leaving in about fifteen minuets to go to her house.

• The study was conducted mainly be John Black.

• Hopefully, all with continue smoothly in my absence.

• Can they lave him my messages?

• I need to notified the bank of….

• He is trying to fine out.

Page 22: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

Corpora

• Corpora are online collections of text and speech – Brown Corpus

– Wall Street Journal

– AP newswire

– Hansards

– Timit

– DARPA/NIST text/speech corpora (Call Home, Call Friend, ATIS, Switchboard, Broadcast News, Broadcast Conversation, TDT, Communicator)

– TRAINS, Boston Radio News Corpus

Page 23: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

Counting Words in Corpora

• What is a word?

– e.g., are cat and cats the same word?

– September and Sept?

– zero and oh?

– Is _ a word? * ? ) . ,

– How many words are there in don’t ? Gonna ?

– In Japanese and Chinese text -- how do we identify a word?

Page 24: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

Terminology

• Sentence: unit of written language • Utterance: unit of spoken language • Word Form: the inflected form as it actually

appears in the corpus • Lemma: an abstract form, shared by word

forms having the same stem, part of speech, word sense – stands for the class of words with same stem

• Types: number of distinct words in a corpus (vocabulary size)

• Tokens: total number of words

Page 25: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

Word-based Language Models

• A model that enables one to compute the probability, or likelihood, of a sentence S, P(S).

• Simple: Every word follows every other word w/ equal probability (0-gram)

– Assume |V| is the size of the vocabulary V

– Likelihood of sentence S of length n is = 1/|V| × 1/|V| … × 1/|V|

– If English has 100,000 words, probability of each next word is 1/100000 = .00001

Page 26: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

Word Prediction: Simple vs. Smart

• Smarter: probability of each next word is related to word frequency (unigram)

– Likelihood of sentence S = P(w1) × P(w2) × … × P(wn)

– Assumes probability of each word is independent of probabilities of other words.

• Even smarter: Look at probability given previous words (N-gram)

– Likelihood of sentence S = P(w1) × P(w2|w1) × … × P(wn|wn-1)

– Assumes probability of each word is dependent on probabilities of other words.

Page 27: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

Chain Rule

• Conditional Probability – P(w1,w2) = P(w1) · P(w2|w1)

• The Chain Rule generalizes to multiple events – P(w1, …,wn) = P(w1) P(w2|w1) P(w3|w1,w2)…P(wn|w1…wn-1)

• Examples: – P(the dog) = P(the) P(dog | the)

– P(the dog barks) = P(the) P(dog | the) P(barks| the dog)

Page 28: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

Relative Frequencies and Conditional Probabilities

• Relative word frequencies are better than equal probabilities for all words

– In a corpus with 10K word types, each word would have P(w) = 1/10K

– Does not match our intuitions that some words are more likely to occur (e.g. the) than others

• Conditional probability - more useful than individual relative word frequencies

– dog may be relatively rare in a corpus

– But if we see barking, P(dog|barking) may be very large

Page 29: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

N-Gram Models • Estimate probability of each word given prior context.

– P(phone | Please turn off your cell)

• Number of parameters required grows exponentially with the number of words of prior context.

• An N-gram model uses only N1 words of prior context. – Unigram: P(phone) – Bigram: P(phone | cell) – Trigram: P(phone | your cell)

• The Markov assumption is the presumption that the future behavior of a dynamical system only depends on its recent history. In particular, in a kth-order Markov model, the next state only depends on the k most recent states, therefore an N-gram model is a (N1)-order Markov model.

Page 30: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

N-Gram Model Formulas

• Word sequences

• Chain rule of probability

• Bigram approximation

• N-gram approximation

n

n www ...11

)|()|()...|()|()()( 1

1

1

1

1

2

131211

kn

k

k

n

n

n wwPwwPwwPwwPwPwP

)|()( 1

1

1

1

k

Nk

n

k

k

n wwPwP

)|()( 1

1

1

k

n

k

k

n wwPwP

Page 31: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

Maximum Likelihood Estimate (MLE)

• N-gram conditional probabilities can be estimated from raw text based on the relative frequency of word sequences.

• To have a consistent probabilistic model, append a unique start (<s>) and end (</s>) symbol to every sentence and treat these as additional words.

)(

)()|(

1

11

n

nnnn

wC

wwCwwP

)(

)()|(

1

1

1

11

1

n

Nn

n

n

Nnn

NnnwC

wwCwwP

Bigram:

N-gram:

Page 32: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

Simple N-Grams

• An N-gram model uses the previous N-1 words to predict the next one: – P(wn | wn-N+1 wn-N+2… wn-1 )

• unigrams: P(dog)

• bigrams: P(dog | big)

• trigrams: P(dog | the big)

• quadrigrams: P(dog | chasing the big)

Page 33: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

Using N-Grams

•Recall that

–N-gram: P(wn|w1n-1

) ≈ P(wn|wn-N+1n-1)

–Bigram: P(w1n) ≈

•For a bigram grammar

–P(sentence) can be approximated by multiplying all the bigram probabilities in the sequence

•Example: P(I want to eat Chinese food) = P(I | <start>) P(want | I) P(to | want) P(eat | to) P(Chinese | eat) P(food | Chinese)

n

k

kk wwP1

1|

Page 34: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

A Bigram Grammar Fragment

Eat on .16 Eat Thai .03

Eat some .06 Eat breakfast .03

Eat lunch .06 Eat in .02

Eat dinner .05 Eat Chinese .02

Eat at .04 Eat Mexican .02

Eat a .04 Eat tomorrow .01

Eat Indian .04 Eat dessert .007

Eat today .03 Eat British .001

Page 35: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

Additional Grammar

<start> I .25 Want some .04

<start> I’d .06 Want Thai .01

<start> Tell .04 To eat .26

<start> I’m .02 To have .14

I want .32 To spend .09

I would .29 To be .02

I don’t .08 British food .60

I have .04 British restaurant .15

Want to .65 British cuisine .01

Want a .05 British lunch .01

Page 36: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

Computing Sentence Probability

• P(I want to eat British food) = P(I|<start>) P(want|I) P(to|want) P(eat|to) P(British|eat) P(food|British) = .25×.32×.65×.26×.001×.60 = .000080

• vs.

• P(I want to eat Chinese food) = .00015

• Probabilities seem to capture “syntactic'' facts, “world knowledge'' – eat is often followed by a NP

– British food is not too popular

• N-gram models can be trained by counting and normalization

Page 37: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

N-grams Issues

• Sparse data – Not all N-grams are found in training data, need smoothing

• Change of domain – Train on Wall Street Journal, attempt to identify Shakespeare – won’t

work

• N-grams more reliable than (N-1)-grams – But even more sparse

• Generating Shakespeare sentences with random unigrams...

– Every enter now severally so, let

• With bigrams...

– What means, sir. I confess she? then all sorts, he is trim, captain.

• Trigrams

– Sweet prince, Falstaff shall die.

Page 38: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

N-grams Issues

• Determine reliable sentence probability estimates – should have smoothing capabilities (avoid the zero-counts)

– apply back-off strategies: if N-grams are not possible, back-off to (N-1) grams

• P(“And nothing but the truth”) 0.001

• P(“And nuts sing on the roof”) 0

Page 39: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

Zipf’s Law

• Rank (r): The numerical position of a word in a list sorted by decreasing frequency (f ).

• Zipf (1949) “discovered” that:

• If probability of word of rank r is pr and N is the total number of word occurrences:

)constant (for kkrf

1.0 const. indp. corpusfor Ar

A

N

fpr

Page 40: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

Predicting Occurrence Frequencies

• By Zipf, a word appearing n times has rank rn=AN/n

• If several words may occur n times, assume rank rn applies to the last of these.

• Therefore, rn words occur n or more times and rn+1 words occur n+1 or more times.

• So, the number of words appearing exactly n times is:

Fraction of words with frequency n is:

Fraction of words appearing only once is therefore ½.

)1(

1

nnD

In

)1(11

nn

AN

n

AN

n

ANrrI nnn

Page 41: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

Zipf’s Law Impact on Language Analysis

• Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size of vocabulary in a text

• Bad News: For most words, gathering sufficient data for meaningful statistical analysis (e.g. for correlation analysis for query expansion) is difficult since they are extremely rare.

Page 42: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

Train and Test Corpora

• A language model must be trained on a large corpus of text to estimate good parameter values.

• Model can be evaluated based on its ability to predict a high probability for a disjoint (held-out) test corpus (testing on the training corpus would give an optimistically biased estimate).

• Ideally, the training (and test) corpus should be representative of the actual application data.

• May need to adapt a general model to a small amount of new (in-domain) data by adding highly weighted small corpus to original training data.

Page 43: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

Evaluation of Language Models

• Ideally, evaluate use of model in end application (extrinsic, in vivo)

– Realistic

– Expensive

• Evaluate on ability to model test corpus (intrinsic).

– Less realistic

– Cheaper

• Verify at least once that intrinsic evaluation correlates with an extrinsic one.

Page 44: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

Evaluation and Data Sparsity Questions

• Perplexity and entropy: how do you estimate how well your language model fits a corpus once you’re done?

• Smoothing and Backoff : how do you handle unseen n-grams?

Page 45: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

Perplexity and Entropy

• Information theoretic metrics – Useful in measuring how well a grammar or

language model (LM) models a natural language or a corpus

• Entropy: With 2 LMs and a corpus, which LM is the better match for the corpus? How much information is in a grammar or LM about what the next word will be? More is better! – For a random variable X ranging over e.g. bigrams

and a probability function p(x), the entropy of X is the expected negative log probability

nx

xxpxpXH

1)(log)()( 2

Page 46: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

• Perplexity – At each choice point in a grammar

• What are the average number of choices that can be made, weighted by their probabilities of occurrence?

• I.e., Weighted average branching factor

– How much probability does a grammar or language model (LM) assign to the sentences of a corpus, compared to another LM?

– The more information, the lower the perplexity.

2)(

)(WH

WPP

Page 47: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

Perplexity • Measure of how well a model “fits” the test data.

• Uses the probability that the model assigns to the test corpus.

• Normalizes for the number of words in the test corpus and takes the inverse.

N

NwwwPWPP

)...(

1)(

21

• Measures the weighted average branching factor in predicting the next word (lower is better).

Page 48: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

Sample Perplexity Evaluation

• Models trained on 38 million words from the Wall Street Journal (WSJ) using a 19,979 word vocabulary.

• Evaluate on a disjoint set of 1.5 million WSJ words.

Unigram Bigram Trigram

Perplexity 962 170 109

Page 49: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

Smoothing • Since there are a combinatorial number of

possible word sequences, many rare (but not impossible) combinations never occur in training, so zero is incorrectly assigned to many parameters (a.k.a. sparse data).

• If a new combination occurs during testing, it is given a probability of zero and the entire sequence gets a probability of zero (i.e. infinite perplexity).

• In practice, parameters are smoothed (a.k.a. regularized) to reassign some probability mass to unseen events. – Adding probability mass to unseen events requires

removing it from seen ones (discounting) in order to maintain a joint distribution that sums to 1.

Page 50: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

Unknown Words

• How to handle words in the test corpus that did not occur in the training data, i.e. out of vocabulary (OOV) words?

• Train a model that includes an explicit symbol for an unknown word (<UNK>). – Choose a vocabulary in advance and replace other

words in the training corpus with <UNK>.

– Replace the first occurrence of each word in the training data with <UNK>.

Page 51: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

Laplace (Add-One) Smoothing • “Hallucinate” additional training data in which each

possible N-gram occurs exactly once and adjust estimates accordingly.

where V is the total number of possible (N1)-grams (i.e. the vocabulary size for a bigram model).

VwC

wwCwwP

n

nnnn

)(

1)()|(

1

11

VwC

wwCwwP

n

Nn

n

n

Nnn

Nnn

)(

1)()|(

1

1

1

11

1

Bigram:

N-gram:

• Tends to reassign too much mass to unseen events, so can be adjusted to add 0<<1 (normalized by V instead of V).

Page 52: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

Advanced Smoothing

• Many advanced techniques have been developed to improve smoothing for language models.

– Good-Turing

– Interpolation

– Backoff

– Kneser-Ney

– Class-based (cluster) N-grams

Page 53: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

Model Combination

• As N increases, the power (expressiveness) of an N-gram model increases, but the ability to estimate accurate parameters from sparse data decreases (i.e. the smoothing problem gets worse).

• A general approach is to combine the results of multiple N-gram models of increasing complexity (i.e. increasing N).

Page 54: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

Interpolation

• Linearly combine estimates of N-gram models of increasing order.

• Learn proper values for i by training to (approximately) maximize the likelihood of an independent development (a.k.a. tuning) corpus.

)(*)|(*)|()|(ˆ 3121,211,2 nnnnnnnnn wPwwPwwwPwwwP

Interpolated Trigram Model:

1i

iWhere:

Page 55: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

Backoff • Only use lower-order model when data for higher-

order model is unavailable (i.e. count is zero).

• Recursively back-off to weaker models until data is available.

otherwise)|()(

1)( if)|(*)|(

1

2

1

1

1

1

11

1 n

Nnnkatz

n

Nn

n

Nn

n

Nnnn

NnnkatzwwPw

wCwwPwwP

Where P* is a discounted probability estimate to reserve mass for unseen events and ’s are back-off weights .

Page 56: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

A Problem for N-Grams: Long Distance Dependencies

• Many times local context does not provide the most useful predictive clues, which instead are provided by long-distance dependencies. – Syntactic dependencies

• “The man next to the large oak tree near the grocery store on the corner is tall.”

• “The men next to the large oak tree near the grocery store on the corner are tall.”

– Semantic dependencies • “The bird next to the large oak tree near the grocery store on

the corner flies rapidly.” • “The man next to the large oak tree near the grocery store on

the corner talks rapidly.”

• More complex models of language are needed to handle such dependencies.

Page 57: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

Summary

• Language models assign a probability that a sentence is a legal string in a language.

• They are useful as a component of many NLP systems, such as ASR, OCR, and MT.

• Simple N-gram models are easy to train on unsupervised corpora and can provide useful estimates of sentence likelihood.

• MLE gives inaccurate parameters for models trained on sparse data.

• Smoothing techniques adjust parameter estimates to account for unseen (but not impossible) events.

Page 58: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

Google NGrams

Page 59: Statistical Methods for NLPdtrandabat/pages/courses/... · 2020-03-05 · •Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces size

Great!

Let’s go to practicalities!