TOPIC 4: LANGUAGE MODELINGnlpcs724.weebly.com/.../2/66126761/cs724_nlp_topic_4-language_m… · • Introduction to Language Modeling, • Application, • N-gram model, • Smoothing

TOPIC 4: LANGUAGE MODELING

NATURAL LANGUAGE PROCESSING (NLP)

CS-724

WondwossenMulugeta (PhD) email: [email protected]

1

Topics

Topics Subtopics

4: Language Modeling

• Introduction to Language Modeling,

• Application, • N-gram model, • Smoothing Techniques, • Model Evaluation

2

3

What is a Language Model

Language models assign a probability that a sentence is a legal string in a

language.

Having a way to estimate the relative likelihood of different phrases is useful

in many natural language processing applications.

• We can view a finite state automaton as a deterministic language

Model

I wish I wish I wish I wish . . . CANNOT generate: “wish I wish” or “I wish I”.

Our basic model: each document was generated by a different automaton like this except that these automata are probabilistic.

https://en.wikipedia.org/wiki/Natural_language_processing

Language Models

Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences in a language.

For NLP, a probabilistic model of a language that gives a probability that a string is a member of a language is more useful.

To specify a correct probability distribution, the probability of all sentences in a language must sum to 1.

Uses of Language Models

Speech recognition

“I ate a cherry” is a more likely sentence than “Eye eight uhJerry”

OCR & Handwriting recognition

More probable sentences are more likely correct readings.

Machine translation

More likely sentences are probably better translations.

Generation

More likely sentences are probably better NL generations.

Context sensitive spelling correction

“Their are problems wit this sentence.”

Uses of Language Models

A language model also supports predicting the

completion of a sentence.

Please turn off your cell _____

Your program does not ______

Predictive text input systems can guess what you

are typing and give choices on how to complete it.

N-Gram Models

Estimate probability of each word given prior context.

P(phone | Please turn off your cell)

Number of parameters required grows exponentially with the number of words of prior context.

An N-gram model uses only N1 words of prior context.

Unigram: P(phone)

Bigram: P(phone | cell)

Trigram: P(phone | your cell)

N-Gram Models

The Markov assumption is the presumption that the future behavior of a dynamical system only depends on its recent history. In particular, in a kth-order Markov model, the next state only depends on the k most recent states, therefore an N-gram model is a (N1)-order Markov model.

9

A probabilistic language model

This is a one-state probabilistic finite-state automaton

Or it is called a unigram language model and the state emission distribution for its one state q1.

STOP is not a word, but a special symbol indicating that the automaton stops.

Eg: “frog said that toad likes frog” STOP

P(string) = 0.01 x 0.03 x 0.04 x 0.01 x 0.02 x 0.01 x 0.02

= 0.0000000000048

Probabilistic Language Modeling

Goal: compute the probability of a sentence or sequence of words:

P(W) = P(w1,w2,w3,w4,w5…wn)

Related task: probability of an upcoming word:

P(w5|w1,w2,w3,w4)

A model that computes either of these:

P(W) or P(wn|w1,w2…wn-1)

This is called a Language Model.

How to compute P(W):

The Chain Rule

How to compute this joint probability:

P(its, water, is, so, transparent, that)

Intuition: let’s rely on the Chain Rule of Probability

Recall the definition of conditional probabilities

P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C)

The Chain Rule in General

P(x1,x2,x3,…,xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1,…,xn-1)

P(this is a boy) =P(this)*P(is/this)*P(a/this is)*P(boy/this is a)

The Chain Rule applied to compute joint

probability of words in sentence

P(“its water is so transparent”) =

P(its) × P(water | its) × P( is| its water)

× P(so |its water is) × P(transparent | its water is so)

i

iin wwwwPwwwP )|()( 12121

How to estimate these probabilities

Could we just count and divide?

Too many possible sentences!

We’ll never see enough data for estimating these

P(the | its water is so transparent that) =

Count(its water is so transparent that the)

Count(its water is so transparent that)

Markov Assumption

Simplifying assumption:

Or maybe

P(the | its water is so transparent that) » P(the | that)

P(the | its water is so transparent that) » P(the | transparent that)

Andrei Markov

Markov Assumption

In other words, we approximate each

component in the product

i

ikiin wwwPwwwP )|()( 121

)|()|( 1121 ikiiii wwwPwwwwP

N-Gram Model Formulas

Word sequences

Chain rule of probability

Bigram approximation

N-gram approximation

n

n www ...11

)|()|()...|()|()()( 1

1

1

1

1

2

131211

kn

k

k

n

n

n wwPwwPwwPwwPwPwP

)|()( 1

1

1

1

k

Nk

n

k

k

n wwPwP

)|()( 1

1

1

k

n

k

k

n wwPwP

Estimating Probabilities

N-gram conditional probabilities can be estimated from raw text based on the relative frequency of word sequences.

To have a consistent probabilistic model, append a unique start (<s>) and end (</s>) symbol to every sentence and treat these as additional words.

)(

)()|(

1

11

n

nnnn

wC

wwCwwP

)(

)()|(

1

1

1

11

1

n

Nn

n

n

Nnn

NnnwC

wwCwwP

Bigram:

N-gram:

Generative Model & MLE

An N-gram model can be seen as a probabilistic

automata for generating sentences.

Relative frequency estimates can be proven to be

maximum likelihood estimates (MLE) since they

maximize the probability that the model M will

generate the training corpus T.

Initialize sentence with N1 <s> symbols

Until </s> is generated do:

Stochastically pick the next word based on the conditional

probability of each word given the previous N 1 words.

))(|(argmaxˆ

MTP

Train and Test Corpora

A language model must be trained on a large corpus of text to estimate good parameter values.

Model can be evaluated based on its ability to predict a high probability for a disjoint test corpus (testing on the training corpus would give an optimistically biased estimate).

Ideally, the training (and test) corpus should be representative of the actual application data.

Unknown Words

How to handle words in the test corpus that did not occur in the training data, i.e. out of vocabulary (OOV) words?

Train a model that includes an explicit symbol for an unknown word (<UNK>).

Choose a vocabulary in advance and replace other words in the training corpus with <UNK>.

Replace the first occurrence of each word in the training data with <UNK>.

Evaluation of Language Models

Ideally, evaluate use of model in end users application

(extrinsic)

Realistic

Expensive

Evaluate on ability to model test corpus (intrinsic).

Less realistic

Cheaper

Verify at least once that intrinsic evaluation correlates

with an extrinsic one.

Extrinsic evaluation of N-gram models

Best evaluation for comparing models A and B

Put each model in a task

spelling corrector, speech recognizer, machine translation

system

Run the task, get an accuracy for A and for B

How many misspelled words corrected properly

How many words translated correctly

Compare accuracy for A and B

22

Difficulty of extrinsic evaluation

Extrinsic evaluationTime-consuming; can take days or weeks

Intrinsic evaluation intrinsic evaluation: perplexity

Bad approximation

unless the test data looks just like the training data

So generally only useful in pilot experiments

23

Intuition of Perplexity

How well can we predict the next word?

Unigrams provide the lowest accuracy

A better model is one which assigns a higher probability to the word that actually occurs

I always order pizza with cheese and ____

The 33rd President of the US was ____

I saw a ____

mushrooms 0.1

pepperoni 0.1

anchovies 0.01

….

fried rice 0.0001

….

and 1e-100

24

Perplexity

Perplexity is the probability of the test set, normalized by the number of words:

Chain rule:

For bigrams:

The best language model is one that best predicts an unseen test set

• Gives the highest P(sentence)

25

𝑃𝑃 𝑊 = 𝑃 𝑤1𝑤2…𝑤𝑁−1𝑁

=𝑁 1

𝑃(𝑤1𝑤2…𝑤𝑁)

Perplexity

Measure of how well a model “fits” the test data.

Uses the probability that the model assigns to the test corpus.

Normalizes for the number of words in the test corpus and takes the inverse.

N

NwwwPWPP

)...(

1)(

21

• Measures the weighted average branching factor in predicting the next word (lower is better).

Perplexity as branching factor

Let’s suppose a sentence consisting of only ten words

What is the perplexity of this sentence according to a model that assign P=1/10 to each word?

27

Example

A random sentences have the following three words,

which appear with the following probabilities. Using a

unigram model, what is the perplexity of the

sequence (green, yellow, red)?

P(He) = 2/5

P(Loves) = 1/5

P(her) = 2/5

28

𝑃𝑃 ℎ𝑒, 𝑙𝑜𝑣𝑒𝑠, ℎ𝑒𝑟 =2

5×1

5×2

5

−13

Lower perplexity = better model

Training 38 million words, test 1.5 million words, WSJ

N-gram

Order

Unigram Bigram Trigram

Perplexity 962 170 109

29

Smoothing

Since there are a combinatorial number of possible word sequences, many rare (but not impossible) combinations never occur in training, so most models incorrectly assigns zero to many parameters The case of sparse data. Typical example is Amharic

If a new combination occurs during testing, it is given a probability of zero and the entire sequence gets a probability of zero (i.e. infinite perplexity).

In practice, parameters are smoothed to reassign some probability mass to unseen events. Adding probability mass to unseen events requires removing

it from seen ones (discounting) in order to maintain a joint distribution that sums to 1.

Advanced Smoothing

There are various techniques developed to improve

smoothing for language models.

Good-Turing

Interpolation

Backoff

Class-based (cluster) N-grams

Model Combination

As N increases, the power (expressiveness) of an N-

gram model increases, but the ability to estimate

accurate parameters from sparse data decreases (i.e.

the smoothing problem gets worse).

A general approach is to combine the results of

multiple N-gram models of increasing complexity

Interpolation

Linearly combine estimates of N-gram models of

increasing order.

• Learn proper values for i by training to

(approximately) maximize the likelihood of an

independent development corpus.

• This is called tuning

• Recall the model we build for tagging

)()|()|()|(ˆ 3121,211,2 nnnnnnnnn wPwwPwwwPwwwP

Interpolated Trigram Model:

1i

iWhere:

Backoff

Only use lower-order model when data for higher-

order model is unavailable (i.e. count is zero).

Recursively back-off to weaker models until data is

available.

Start from the most restrictive and move down to the

relaxed model.

A Problem for N-Grams:

Long Distance Dependencies

Many times local context does not provide the most useful predictive clues, which instead are provided by long-distance dependencies. Syntactic dependencies

“The man next to the large oak tree near the grocery store on the corner is tall.”

“The men next to the large oak tree near the grocery store on the corner are tall.”

Semantic dependencies “The bird next to the large oak tree near the grocery store on the

corner flies rapidly.”

“The man next to the large oak tree near the grocery store on the corner talks rapidly.”

More complex models of language are needed to handle such dependencies.

The cases

Unigrammodel: probability of the occurrence of

a word in a corpus

Bigrammodel: probability of the occurrence of a

word followed by another word

Trigrammodel: probability of the occurrence of a

word in a corpus followed by two specific words

N-grammodel: …… n words

N-gram models

We can extend to trigrams, 4-grams, 5-grams

In general this is an insufficient model of

language

because language has long-distance dependencies:

“The computer which I had just put into the machine

room on the fifth floor crashed.”

But we can often get away with N-gram models

Estimating bigram probabilities

The Maximum Likelihood Estimate

P(wi |wi-1) =count(wi-1,wi)

count(wi-1)

An example

<s> I am Sam </s>

<s> Sam I am </s>

<s> I do not like green eggs and ham </s>

P(wi |wi-1) =c(wi-1,wi)

c(wi-1)

Raw bigram counts

Out of 9222 sentences

Raw bigram probabilities

Normalize by unigrams:

Result:

Bigram estimates of sentence

probabilities

P(<s> I want english food </s>) =

P(I|<s>)

× P(want|I)

× P(english|want)

× P(food|english)

× P(</s>|food)

= .000031

End of Topic 4

Documents

TOPIC 4: LANGUAGE MODELINGnlpcs724.weebly.com/.../2/66126761/cs724_nlp_topic_4-language_m… · • Introduction to Language Modeling, • Application, • N-gram model, • Smoothing