44
NLP Language Models 1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

Embed Size (px)

Citation preview

Page 1: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 1

• Language Models, LM• Noisy Channel model• Simple Markov Models• Smoothing

Statistical Language Models

Page 2: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 2

Two Main Approaches to NLPKnowlege (AI)

•Statistical models

- Inspired in speech recognition :

probability of next word based on previous

- Statistical models different from

statistical methods methods

Page 3: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 3

Probability Theory

• X be uncertain outcome of some event. Called a random variable

• V(X) finite number of possible outcome (not a real number)

• P(X=x), probability of the particular outcome x (x belongs V(X))

• X desease of your patient, V(X) all possible diseases,

Page 4: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 4

Conditional probability of the outcome of an event based upon

the outcome of a second event

We pick two words randomly from a book. We know first word is

the, we want to know probability second word is dog

P(W2 = dog|W1 = the) = | W1 = the,W2= dog|/ |W1 = the|

Bayes’s law: P(x|y) = P(x) P(y|x) / P(y)

Probability Theory

Page 5: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 5

Bayes’s law: P(x|y) = P(x) P(y|x) / P(y)

P(desease/symptom)= P(desease)P(symptom/desease)/P(symptom)

P(w1,n| speech signal) = P(w1,n)P(speech signal | w1,n)/ P(speech signal)

We only need to maximize the numerator

P(speech signal | w1,n) expresses how well the speech signal fits

the sequence of words w1,n

Probability Theory

Page 6: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 6

Useful generalizations of the Bayes’s law

- To find the probability of something happening calculate the probability that it hapens given some second event times the probability of the second event

- P(w,z|y,z) = P(w,x) P(y,z | w,z) / P(y,z)

Where x,y,y,z aseparate event (i.e. take a word)

- P(w1,w2,..,wn) =P(w1)P(w2|w1)P(w3|w1,w2),… P(wn|w1..,wn-1)

also when conditioned on some event

P(w1,w2,..,wn|x) =P(w1|x)P(w2|w1,x) … P(wn|w1..,wn-1,x)

Probability Theory

Page 7: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 7

Statistical Model of a Language

• Statistical models of words of sentences – language models

• Probability of all possible sequences of words .

• For sequences of words of length n,

• assign a number to P(W1,n= w1,n), being w1,n a sequence of

n random variable w1,w2,..wn being any word Computing the probability of next word related to computing probability of sequencesP(the big dog) = P(the)P(big/the)P(dog/the bP(the pig dog) = P(the)P(pig/the)P(dog/the pig)

Page 8: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 8

Ngram Model

• Simple but durable statistical model

• Useful to indentify words in noisy, ambigous input.

• Speech recognition, many input speech sounds similar and confusable

• Machine translation, spelling correction, handwriting recognition, predictive text input

• Other NLP tasks: part of speech tagging, NL generation, word similarity

Page 9: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 9

CORPORA

• Corpora (singular corpus) are online collections of text or speech.

• Brown Corpus: 1 million word collection from 500 written texts

• from different genres (newspaper,novels, academic).

• Punctuation can be treated as words.

• Switchboard corpus: 2430 Telephone conversations averaging 6 minutes each, 240 hour of speech and 3 million words

There is no punctuation.

There are disfluencies: filled pauses and fragments

“I do uh main- mainly business data processing”

Page 10: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 10

Training and Test Sets

• Probabilities of N-gram model come from

the corpus it is trained for

• Data in the corpus is divided into training set

• (or training corpus) and test set (or test corpus).

• Perplexity: compare statistical models

Page 11: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 11

Ngram Model

• How can we compute probabilities of entire sequences P(w1,w2,..,wn)

• Descomposition using the chain rule of probability

P(w1,w2,..,wn) =P(w1)P(w2|w1)P(w3|w1,w2),… P(wn|w1..,wn-1)

• Assigns a conditional probability to possible next words considering the history.

• Markov assumption : we can predict the probability of some future unit without looking too far into the past.

• Bigrams only consider previous usint, trigrams, two previous unit, n-grams, n previous unit

Page 12: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 12

Ngram Model

• Assigns a conditional probability to possible next words.Only n-1 previous words have effect on the probabilities of next word

• For n = 3, Trigrams P(wn|w1..,wn-1) = P(wn|wn-2,wn-1)

• How we estimate these trigram or N-gram probabilities?

Maximum Likelihood Estimation (MLE)

To maximize the likelihood of the training set T given the model M (P(T/M)

• To create the model use training text (corpus), taking counts and normalizing them so they lie between 0 and 1.

Page 13: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 13

Ngram Model

• For n = 3, Trigrams

P(wn|w1..,wn-1) = P(wn|wn-2,wn-1)

• To create the model use training text and record pairs and triples of words that appear in the text and how many times

P(wi|wi-2,wi-1)= C(wi-2,i) / C(wi-2,i-1)

P(submarine|the, yellow) = C(the,yellow, submarine)/C(the,yellow)

Relative frequency: observed frequency of a particular sequence divided by observed fequency of a prefix

Page 14: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 14

• Statistical Models• Language Models (LM)• Vocabulary (V), word

• w V

• Language (L), sentence • s L• L V* usually infinite

• s = w1,…wN

• Probability of s• P(s)

Language Models

Page 15: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 15

W X W*Yencoder decoderChannel

p(y|x)message

input to channel

Output fromchannel

Attempt to reconstruct message based on output

Noisy Channel Model

Page 16: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 16

• In NLP we do not usually act on coding. The problem is reduced to decode for getting the most likely input given the output,

p i p o∣i ¿

decoderNoisy Channel p(o|I)

I O I

Noisy Channel Model in NLP

Page 17: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 17

p i p o∣i ¿

Language Model Channel Model

Page 18: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 18

noisy channel X Y

real Language X

Observed Language Y

We want to retrieve X from Y

Page 19: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 19

Correct text

errors

text with errors

noisy channel X Y

real Language X

Observed Language Y

Page 20: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 20

Correct text

space removing

text without spaces

noisy channel X Y

real Language X

Observed Language Y

Page 21: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 21

language model

acoustic model noisy channel X Y

real Language X

Observed Language Y

text

spelling

speech

Page 22: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 22

Source Language

Translationnoisy channel X Y

real Language X

Observed Language YTarget Language

Page 23: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 23

Acoustic chain word chain

Language model Acoustic Model

example: ASR Automatic Speech Recognizer

ASRX1 ... XT w1 ... wN

Page 24: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 24

Target Language Model Translation Model

example: Machine Translation

Page 25: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 25

• Naive Implementation• Enumerate s L• Compute p(s)• Parameters of the model |L|

• But ...• L is usually infinite• How to estimate the parameters?

• Simplifications

• History• hi = { wi, … wi-1}

• Markov Models

Page 26: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 26

• Markov Models of order n + 1

• P(wi|hi) = P(wi|wi-n+1, … wi-1)

• 0-gram

• 1-gram• P(wi|hi) = P(wi)

• 2-gram• P(wi|hi) = P(wi|wi-1)

• 3-gram• P(wi|hi) = P(wi|wi-2,wi-1)

Page 27: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 27

• n large:• more context information (more discriminative power)

• n small:• more cases in the training corpus (more reliable)

• Selecting n: • ej. for |V| = 20.000

n num. parameters

2 (bigrams) 400,000,000

3 (trigrams)

8,000,000,000,000

4 (4-grams) 1.6 x 1017

Page 28: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 28

• Parameters of an n-gram model• |V|n

• MLE estimation• From a training corpus

• Problem of sparseness

Page 29: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 29

• 1-gram Model

• 2-gram Model

• 3-gram Model

PMLEw =C w ∣V ∣

P MLE w i∣w i−1 ,w i−2 =C w i−2w i−1w i C w i−2w i−1

P MLE w i∣w i−1 =C w i−1w i C w i−1

Page 30: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 30

Page 31: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 31

Page 32: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 32

True probability distribution

Page 33: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 33

The seen cases are overestimated the unseen ones have a null probability

Page 34: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 34

Save a part of the mass probability from seen cases and assign it to the unseen

ones

SMOOTHING

Page 35: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 35

• Some methods perform on the countings:• Laplace, Lidstone, Jeffreys-Perks

• Some methods perform on the probabilities:• Held-Out• Good-Turing• Descuento

• Some methods combine models• Linear interpolation• Back Off

Page 36: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 36

P laplace w 1⋯w n =C w 1⋯w n 1

N+B

P = probability of an n-gram

C = counting of the n-gram in the training corpus

N = total of n-grams in the training corpus

B = parameters of the model (possible n-grams)

Laplace (add 1)

Page 37: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 37

P Lid w 1⋯w n =C w1⋯w n +λ

N+B⋅λ

= small positive number

M.L.E: = 0Laplace: = 1Jeffreys-Perks: = ½

Lidstone (generalization of Laplace)

Page 38: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 38

• Compute the percentage of the probability mass that has to be reserved for the n-grams unseen in the training corpus

• We separate from the training corpus a held-out corpus

• We compute howmany n-grams unseen in the training corpus occur in the held-out corpus

• An alternative of using a held-out corpus is using Cross-Validation• Held-out interpolation• Deleted interpolation

Held-Out

Page 39: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 39

T r= ∑{w1⋯w

n:C

1w

1⋯w

n=r }C 2w1⋯wn

P how1⋯wn=T rN r N

Let a n-gram w1… wn

r = C(w1… wn)

C1(w1… wn) counting of the n-gram in the training set

C2(w1… wn) counting of the n-gram in the held-out set

Nr number of n-grams with counting r in the training set

Held-Out

Page 40: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 40

r* = adjusted countNr = number of n-gram-types occurring r times

E(Nr) = expected value

E(Nr+1) < E(Nr)

Zipf law

r= r+ 1 E N r+1 E N r

PGT =r /N

Good-Turing

Page 41: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 41

Combination of models:

Linear combination(interpolation)

• Linear combination of de 1-gram, 2-gram, 3-gram, ...• Estimation of using a development corpus

P li w n∣w n−2 ,w n−1 =

λ 1 P 1 w n +λ 2 P 2 w n∣w n−1 +λ1 P 3 w n∣w n−2 ,w n−1

Page 42: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 42

• Start with a n-gram model• Back off to n-1 gram for null (or low) counts• Proceed recursively

Katz’s Backing-Off

Page 43: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 43

• Performing on the history

• Class-based Models• Clustering (or classifying) words into

classes• POS, syntactic, semantic

• Rosenfeld, 2000:• P(wi|wi-2,wi-1) = P(wi|Ci) P(Ci|wi-2,wi-1)• P(wi|wi-2,wi-1) = P(wi|Ci) P(Ci|wi-2,Ci-1)• P(wi|wi-2,wi-1) = P(wi|Ci) P(Ci|Ci-2,Ci-1)• P(wi|wi-2,wi-1) = P(wi|Ci-2,Ci-1)

Page 44: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 44

• Structured Language Models• Jelinek, Chelba, 1999• Including the syntactic structure into

the history

• Ti are the syntactic structures • binarized lexicalized trees