Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
TOPIC 4: LANGUAGE MODELING
NATURAL LANGUAGE PROCESSING (NLP)
CS-724
WondwossenMulugeta (PhD) email: [email protected]
1
Topics
Topics Subtopics
4: Language Modeling
• Introduction to Language Modeling,
• Application, • N-gram model, • Smoothing Techniques, • Model Evaluation
2
3
What is a Language Model
Language models assign a probability that a sentence is a legal string in a
language.
Having a way to estimate the relative likelihood of different phrases is useful
in many natural language processing applications.
• We can view a finite state automaton as a deterministic language
Model
I wish I wish I wish I wish . . . CANNOT generate: “wish I wish” or “I wish I”.
Our basic model: each document was generated by a different automaton like this except that these automata are probabilistic.
Language Models
Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences in a language.
For NLP, a probabilistic model of a language that gives a probability that a string is a member of a language is more useful.
To specify a correct probability distribution, the probability of all sentences in a language must sum to 1.
Uses of Language Models
Speech recognition
“I ate a cherry” is a more likely sentence than “Eye eight uhJerry”
OCR & Handwriting recognition
More probable sentences are more likely correct readings.
Machine translation
More likely sentences are probably better translations.
Generation
More likely sentences are probably better NL generations.
Context sensitive spelling correction
“Their are problems wit this sentence.”
Uses of Language Models
A language model also supports predicting the
completion of a sentence.
Please turn off your cell _____
Your program does not ______
Predictive text input systems can guess what you
are typing and give choices on how to complete it.
N-Gram Models
Estimate probability of each word given prior context.
P(phone | Please turn off your cell)
Number of parameters required grows exponentially with the number of words of prior context.
An N-gram model uses only N1 words of prior context.
Unigram: P(phone)
Bigram: P(phone | cell)
Trigram: P(phone | your cell)
N-Gram Models
The Markov assumption is the presumption that the future behavior of a dynamical system only depends on its recent history. In particular, in a kth-order Markov model, the next state only depends on the k most recent states, therefore an N-gram model is a (N1)-order Markov model.
9
A probabilistic language model
This is a one-state probabilistic finite-state automaton
Or it is called a unigram language model and the state emission distribution for its one state q1.
STOP is not a word, but a special symbol indicating that the automaton stops.
Eg: “frog said that toad likes frog” STOP
P(string) = 0.01 x 0.03 x 0.04 x 0.01 x 0.02 x 0.01 x 0.02
= 0.0000000000048
Probabilistic Language Modeling
Goal: compute the probability of a sentence or sequence of words:
P(W) = P(w1,w2,w3,w4,w5…wn)
Related task: probability of an upcoming word:
P(w5|w1,w2,w3,w4)
A model that computes either of these:
P(W) or P(wn|w1,w2…wn-1)
This is called a Language Model.
How to compute P(W):
The Chain Rule
How to compute this joint probability:
P(its, water, is, so, transparent, that)
Intuition: let’s rely on the Chain Rule of Probability
Recall the definition of conditional probabilities
P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C)
The Chain Rule in General
P(x1,x2,x3,…,xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1,…,xn-1)
P(this is a boy) =P(this)*P(is/this)*P(a/this is)*P(boy/this is a)
The Chain Rule applied to compute joint
probability of words in sentence
P(“its water is so transparent”) =
P(its) × P(water | its) × P( is| its water)
× P(so |its water is) × P(transparent | its water is so)
i
iin wwwwPwwwP )|()( 12121
How to estimate these probabilities
Could we just count and divide?
Too many possible sentences!
We’ll never see enough data for estimating these
P(the | its water is so transparent that) =
Count(its water is so transparent that the)
Count(its water is so transparent that)
Markov Assumption
Simplifying assumption:
Or maybe
P(the | its water is so transparent that) » P(the | that)
P(the | its water is so transparent that) » P(the | transparent that)
Andrei Markov
Markov Assumption
In other words, we approximate each
component in the product
i
ikiin wwwPwwwP )|()( 121
)|()|( 1121 ikiiii wwwPwwwwP
N-Gram Model Formulas
Word sequences
Chain rule of probability
Bigram approximation
N-gram approximation
n
n www ...11
)|()|()...|()|()()( 1
1
1
1
1
2
131211
kn
k
k
n
n
n wwPwwPwwPwwPwPwP
)|()( 1
1
1
1
k
Nk
n
k
k
n wwPwP
)|()( 1
1
1
k
n
k
k
n wwPwP
Estimating Probabilities
N-gram conditional probabilities can be estimated from raw text based on the relative frequency of word sequences.
To have a consistent probabilistic model, append a unique start (<s>) and end (</s>) symbol to every sentence and treat these as additional words.
)(
)()|(
1
11
n
nnnn
wC
wwCwwP
)(
)()|(
1
1
1
11
1
n
Nn
n
n
Nnn
NnnwC
wwCwwP
Bigram:
N-gram:
Generative Model & MLE
An N-gram model can be seen as a probabilistic
automata for generating sentences.
Relative frequency estimates can be proven to be
maximum likelihood estimates (MLE) since they
maximize the probability that the model M will
generate the training corpus T.
Initialize sentence with N1 <s> symbols
Until </s> is generated do:
Stochastically pick the next word based on the conditional
probability of each word given the previous N 1 words.
))(|(argmaxˆ
MTP
Train and Test Corpora
A language model must be trained on a large corpus of text to estimate good parameter values.
Model can be evaluated based on its ability to predict a high probability for a disjoint test corpus (testing on the training corpus would give an optimistically biased estimate).
Ideally, the training (and test) corpus should be representative of the actual application data.
Unknown Words
How to handle words in the test corpus that did not occur in the training data, i.e. out of vocabulary (OOV) words?
Train a model that includes an explicit symbol for an unknown word (<UNK>).
Choose a vocabulary in advance and replace other words in the training corpus with <UNK>.
Replace the first occurrence of each word in the training data with <UNK>.
Evaluation of Language Models
Ideally, evaluate use of model in end users application
(extrinsic)
Realistic
Expensive
Evaluate on ability to model test corpus (intrinsic).
Less realistic
Cheaper
Verify at least once that intrinsic evaluation correlates
with an extrinsic one.
Extrinsic evaluation of N-gram models
Best evaluation for comparing models A and B
Put each model in a task
spelling corrector, speech recognizer, machine translation
system
Run the task, get an accuracy for A and for B
How many misspelled words corrected properly
How many words translated correctly
Compare accuracy for A and B
22
Difficulty of extrinsic evaluation
Extrinsic evaluationTime-consuming; can take days or weeks
Intrinsic evaluation intrinsic evaluation: perplexity
Bad approximation
unless the test data looks just like the training data
So generally only useful in pilot experiments
23
Intuition of Perplexity
How well can we predict the next word?
Unigrams provide the lowest accuracy
A better model is one which assigns a higher probability to the word that actually occurs
I always order pizza with cheese and ____
The 33rd President of the US was ____
I saw a ____
mushrooms 0.1
pepperoni 0.1
anchovies 0.01
….
fried rice 0.0001
….
and 1e-100
24
Perplexity
Perplexity is the probability of the test set, normalized by the number of words:
Chain rule:
For bigrams:
The best language model is one that best predicts an unseen test set
• Gives the highest P(sentence)
25
𝑃𝑃 𝑊 = 𝑃 𝑤1𝑤2…𝑤𝑁−1𝑁
=𝑁 1
𝑃(𝑤1𝑤2…𝑤𝑁)
Perplexity
Measure of how well a model “fits” the test data.
Uses the probability that the model assigns to the test corpus.
Normalizes for the number of words in the test corpus and takes the inverse.
N
NwwwPWPP
)...(
1)(
21
• Measures the weighted average branching factor in predicting the next word (lower is better).
Perplexity as branching factor
Let’s suppose a sentence consisting of only ten words
What is the perplexity of this sentence according to a model that assign P=1/10 to each word?
27
Example
A random sentences have the following three words,
which appear with the following probabilities. Using a
unigram model, what is the perplexity of the
sequence (green, yellow, red)?
P(He) = 2/5
P(Loves) = 1/5
P(her) = 2/5
28
𝑃𝑃 ℎ𝑒, 𝑙𝑜𝑣𝑒𝑠, ℎ𝑒𝑟 =2
5×1
5×2
5
−13
Lower perplexity = better model
Training 38 million words, test 1.5 million words, WSJ
N-gram
Order
Unigram Bigram Trigram
Perplexity 962 170 109
29
Smoothing
Since there are a combinatorial number of possible word sequences, many rare (but not impossible) combinations never occur in training, so most models incorrectly assigns zero to many parameters The case of sparse data. Typical example is Amharic
If a new combination occurs during testing, it is given a probability of zero and the entire sequence gets a probability of zero (i.e. infinite perplexity).
In practice, parameters are smoothed to reassign some probability mass to unseen events. Adding probability mass to unseen events requires removing
it from seen ones (discounting) in order to maintain a joint distribution that sums to 1.
Advanced Smoothing
There are various techniques developed to improve
smoothing for language models.
Good-Turing
Interpolation
Backoff
Class-based (cluster) N-grams
Model Combination
As N increases, the power (expressiveness) of an N-
gram model increases, but the ability to estimate
accurate parameters from sparse data decreases (i.e.
the smoothing problem gets worse).
A general approach is to combine the results of
multiple N-gram models of increasing complexity
Interpolation
Linearly combine estimates of N-gram models of
increasing order.
• Learn proper values for i by training to
(approximately) maximize the likelihood of an
independent development corpus.
• This is called tuning
• Recall the model we build for tagging
)()|()|()|(ˆ 3121,211,2 nnnnnnnnn wPwwPwwwPwwwP
Interpolated Trigram Model:
1i
iWhere:
Backoff
Only use lower-order model when data for higher-
order model is unavailable (i.e. count is zero).
Recursively back-off to weaker models until data is
available.
Start from the most restrictive and move down to the
relaxed model.
A Problem for N-Grams:
Long Distance Dependencies
Many times local context does not provide the most useful predictive clues, which instead are provided by long-distance dependencies. Syntactic dependencies
“The man next to the large oak tree near the grocery store on the corner is tall.”
“The men next to the large oak tree near the grocery store on the corner are tall.”
Semantic dependencies “The bird next to the large oak tree near the grocery store on the
corner flies rapidly.”
“The man next to the large oak tree near the grocery store on the corner talks rapidly.”
More complex models of language are needed to handle such dependencies.
The cases
Unigrammodel: probability of the occurrence of
a word in a corpus
Bigrammodel: probability of the occurrence of a
word followed by another word
Trigrammodel: probability of the occurrence of a
word in a corpus followed by two specific words
N-grammodel: …… n words
N-gram models
We can extend to trigrams, 4-grams, 5-grams
In general this is an insufficient model of
language
because language has long-distance dependencies:
“The computer which I had just put into the machine
room on the fifth floor crashed.”
But we can often get away with N-gram models
Estimating bigram probabilities
The Maximum Likelihood Estimate
P(wi |wi-1) =count(wi-1,wi)
count(wi-1)
An example
<s> I am Sam </s>
<s> Sam I am </s>
<s> I do not like green eggs and ham </s>
P(wi |wi-1) =c(wi-1,wi)
c(wi-1)
Raw bigram counts
Out of 9222 sentences
Raw bigram probabilities
Normalize by unigrams:
Result:
Bigram estimates of sentence
probabilities
P(<s> I want english food </s>) =
P(I|<s>)
× P(want|I)
× P(english|want)
× P(food|english)
× P(</s>|food)
= .000031
End of Topic 4