Speech & NLP (Fall 2014): N-Grams, N-Gram Computation, Word Sequence Probabilities, N-Gram Smoothing, Markov Models

Speech & NLP

www.vkedco.blogspot.com

N-Grams, N-Gram Computation, N-Gram Smoothing, Word Sequence

Probabilities, N-Gram Vectors, Markov Models

Vladimir Kulyukin

http://www.vkedco.blogspot.com/

http://www.vkedco.blogspot.com/

http://www.linkedin.com/pub/vladimir-kulyukin/23/2a2/150



Outline

● N-Grams

● N-Gram Computation

● N-Gram Smoothing

● Markov & Hidden Markov Models (HMMs)

N-Grams

Introduction

● Word prediction is a fundamental task of spelling

checking, speech recognition, augmentative

communication, and many other areas of NLP

● Word prediction algorithms are typically trained on

various text corpora

● N-Gram is a word prediction model that uses the

previous N-1 words to predict the next word

● In statistical NLP, N-Gram is called a language model

(LM) or grammar

Word Prediction Examples

● See if you can predict the next word:

– It happened a long time …

– She wants to make a collect phone …

– I need to open a bank …

– Nutrition labels include serving …

– Nutrition labels include amounts of total …

Word Prediction Examples

● It happened a long time ago.

● She wants to make a collect phone call.

● I need to open a bank account.

● Nutrition labels include serving sizes.

● Nutrition labels include amounts of total fat|

carbohydrate.

Augmentative Communication

● Many people with physical disabilities experience

problems communicating with other people: many of

them cannot speak or type

● Word prediction models can productively augment their

communication efforts by automatically suggesting the

next word to speak or type

● For example, people with disabilities can use simple

hand movements to choose next words to speak or type

Real-Word Spelling Errors

● Real-word spelling errors are real words incorrectly used

● Examples:

– They are leaving in about fifteen minuets to go to her house.

– The study was conducted mainly be John Black.

– The design an construction of the system will take more than a year.

– Hopefully, all with continue smoothly in my absence.

– I need to notified the bank of this problem.

– He is trying to fine out.

K. Kukich, "Techniques for Automatically Correcting Words in Text." ACM

Computing Surveys, Vol. 24, No. 4, Dec. 1992.

Word Sequence Probabilities

● Word prediction is based on evaluating probabilities of specific

word sequences

● To count those probabilities we need a corpus (speech corpus

or text corpus)

● We also need to determine what is counted count and how:

the most important decision is how to handle punctuation

marks and capitalization (text) or pauses like uh and um

(speech)

● What is counted and how depends on the task at hand (e.g.,

punctuation is more important to grammar checking than

spelling correction)

Wordforms, Lemmas, Types, Tokens

● Wordform is an alphanumerical sequence actually used

in the corpus (e.g., begin, began, begun)

● Lemma is a set of word forms (e.g., {begin, began,

begun})

● Token is a synonym of wordform

● Type is a dictionary entry: for example, a dictionary

lists only begin as the main entry for the lemma

{begin, began, begun}

Unsmoothed N-Grams

Notation: A Sequence of N Words

n

n www ... 11

● Example: ‘I understand this algorithm.’

– W1 = ‘I’

– W2 = ‘understand’

– W3 = ‘this’

– W4 = ‘algorithm’

– W5 = ‘.’

Probabilities of Word Sequences

1

11

1

1

2

131211

|

| ... || ...

k

k

n

k

n

nn

wwP

wwPwwPwwPwPwwP

Example:

P(‘I understand this algorithm.’) =

P(‘I’) *

P(‘understand’|‘I’) *

P(‘this’|‘I understand’) *

P(‘algorithm’|‘I understand this’) *

P(‘.’|‘I understand this algorithm’)

Probabilities of Word Sequences

● How difficult is it to compute the required probabilities?

– P(‘I’) - this is easy to compute (e.g., frequency of ‘I’ in

the corpus over the size of the corpus)

– P(‘understand’|‘I’) – harder but quite feasible

– P(‘this’|‘I understand’) – harder still but feasible

– P(‘algorithm’|‘I understand this’) – even harder (Why?)

– P(‘.’|‘I understand this algorithm’) – possible but

impractical

Probability Approximation

● Markov assumption: we can estimate the probability of a word

given only N previous words

● If N = 0, we have the unigram model (aka 0th-order Markov

model)

● If N = 1, we have the bigram model (aka 1st-order Markov

model)

● If N = 2, we have the trigram model (aka 2nd-order Markov

model)

● N can be greater but the higher values are rare, because they

are hard to compute

Bigram Probability Approximation

● <S> is the start of sentence mark; this is a dummy mark

● What is the probability of ‘I understand this

algorithm.’

● P(‘I understand this algorithm.’) =

P(‘I’|<S>) *

P(‘understand’|‘I’) *

P(‘this’|‘understand’) *

P(‘algorithm’|‘this’) *

P(‘.’ |‘algorithm’)

Trigram Probability Approximation

● <S> is the start of sentence mark

● In the trigram model, we assume that at the beginning of the

sentence, there are two start marks <S><S>

● P(‘I understand this algorithm.’) =

P(‘I’|<S><S>) *

P(‘understand’|‘<S>I’) *

P(‘this’|‘I understand’) *

P(‘algorithm’|‘understand this’) *

P(‘.’ |‘this algorithm’)

N-Gram Approximation

4,||||

3,||||

2,||||

1,|| :formula General

123

1

3

1

14

1

1

12

1

2

1

13

1

1

1

1

1

1

12

1

1

1

1

1

1

NwwwwPwwPwwPwwP

NwwwPwwPwwPwwP

NwwPwwPwwPwwP

NwwPwwP

nnnn

n

nn

n

nn

n

nn

nnn

n

nn

n

nn

n

n

nn

n

nn

n

nn

n

n

n

Nnn

n

n

Bigram Approximation

11

1

1 ||

kk

n

k

n

n wwPwwP

Bigram Approximation Example

<S> I 0.25

I understand 0.3

understand this 0.05

this algorithm 0.7

algorithm . 0.45

P(‘I understand this algorithm.’) =

P(‘I’|<S>) * P(‘understand’|‘I’) * P(‘this’|‘understand’) * P(‘algorithm’|‘this’) * P(‘.’ |‘algorithm’) =

0.25 * 0.3 * 0.05 * 0.7 * 0.45 =

0.00118125

Logprobs

● If we compute raw probability products, we risk getting the

problem of numerical underflow: at some point all probability

products become zero, especially on long word sequences

● To address this problem, the probabilities are computed in the

logarithmic space: instead of computing the product of

probabilities, the sum of logarithms of those probabilities is

computed

● log(P(A)P(B)) = log(P(A)) + log(P(B))

● Original product can be recovered: P(A)P(B) = log-1(P(A)P(B))

Bigram Computation

1

1

1

1

11

1

11

11

|

size dictionary is ,

corpusin ofcount

n

nn

V

i

in

nnnn

V

i

nin

nnnn

wC

wwC

wwC

wwCwwP

VwCwwC

wwwwC

Example

.201010

.100100

.310110

.a,c,c,b,a,a,, characters of sequence a as text therepresent can

Weaabcca"." isour text Suppose .,,

,, that so marks end andstart theare ,

where}, c, b, a, ,{ islary our vocabu that Suppose

5

1

543

21

EcCccCcbCcaCScCcC

EbCbcCbbCbaCSbCbC

EaCacCabCaaCSaCawCaC

ES

Ewcwbw

awSwES

ES

i

i

N-Gram Generalization

12

121

2

1

13

1

11

1

1

12

1

1

1

11

1

||,3For

||,2For

:Examples

1,| :formula General

nn

nnnn

nn

n

nn

n

nnn

nn

n

nn

n

Nn

n

n

Nnn

Nnn

wwC

wwwCwwPwwPN

wC

wwCwwPwwPN

NwC

wwCwwP

Maximum Likelihood Estimation

● This N-Gram probability estimation is known as the Maximum

Likelihood Estimation (MLE)

● It is the MLE because it always maximizes the probability of the

training set (the statistics of the training set)

● Example: If a word W occurs 5 times in a training corpus of 100

words, its probability of occurrence is P(W) = 5/100

● This is not a good estimate of P(W) in all corpora, but the one

that maximizes P(W) in the training corpus

N-Gram Smoothing

Unsmoothed N-Gram Problem

● Since any corpus is finite, in any corpus used for computing N-

Grams, some valid N-Grams will not be found

● To put it differently, an N-Gram matrix for any corpus is likely

to be sparse: it will have a large number of possible N-Grams

with zero counts

● The MLE methods produce unreliable estimates when counts are

greater than 0 but still small (small, of course, is relative)

● Smoothing is a set of techniques used to overcome zero or low

counts

Add-One Smoothing

One way to smooth is to add one to all N-Gram counts and

normalize by the size of the dictionary (V)

smoothed one-add //1

|

unsmoothed//|

1

11

*

1

11

VwC

wwCwwP

wC

wwCwwP

n

nnnn

n

nnnn

A Problem with Add-One Smoothing

● Much of the total probability mass moves to the N-

Grams with zero counts

● Researchers attribute it to the arbitrary choice of the

value of 1

● Add-One smoothing appears to be worse than other

methods at predicting N-Grams with zero counts

● Some research indicates that add-one smoothing is no

better than no smoothing

Good-Turing Discounting

● Probability mass assigned to N-Grams with zero or low

counts is reassigned by using with N-Grams with higher

counts

● Let Nc is the number of N-Grams that occur c times in a

corpus

● N0 is the number of N-Grams that occur 0 times

● N1 is the number of N-Grams that occur once

● N2 is the number of N-Grams that occur twice

Good-Turing Discounting

Let C(w1 … wn)=c be a count of some N-Gram w1 … wn,

then the new count smoothed by the GTD, i.e.,

C*(w1 … wn), is:

c

cnn

N

NwwCwwC 1

11

* 1......

N-Gram Vectors

● N-Grams can be computed over any finite symbolic sets

● Those symbolic sets are called alphabets and can

consists of wordforms, waveforms, individual letters,

etc.

● The choice of the symbols in the alphabet depends on

the application

● Regardless of the application, the objective is to take

an input sequence over a specific alphabet and

compute its N-Gram frequency vector

Dimensions of N-Gram Vectors

● Let A be an alphabet and n > 0 be the size of the N-Gram

● The number of N-Gram dimensions is |A|n

● Suppose that the alphabet has 26 characters and we

compute trigrams over that alphabets, then the number

of possible trigrams, i.e., the dimension of N-Gram

frequency vectors is 263 = 17576

● A practical implication is that N-Gram frequency vectors

even for low values of n are sparse

Example

● Suppose the alphabet A = {a, <space>, <start>}

● The number of possible bigrams (n=2) is |A|2 = 9:

– 1) aa; 2) a<start>; 3) a<space>; 4) <start><start>; 5) <start>a;

6) <start><space>; 7) <space><space> ; 8) <space>a;

9) <space><start>

● Suppose the input is = ‘a a’

● The input’s N-Grams are: <start>a, a<space>, <space>a

● To the input’s N-Gram vector: (0, 0, 1, 0, 1, 0, 0, 1, 0)

(this assumes 1-based indexing)

Markov & Hidden Markov Models

Markov Models

11 ... | nn wwwP

Markov Models are closely related to N-Grams:

the basic idea is to estimate the conditional

probability of the n-th observation given a

sequence of n-1 observations

Markov Assumption

order 3rd // | ... |

order 2nd //| ... |

order1st // | ... |

12311

1211

111

nnnnnn

nnnnn

nnnn

wwwwPwwwP

wwwPwwwP

wwPwwwP

● If n = 5 and the size of the observation alphabet is 3, we

need to collect statistics over 35 = 243 sequence types

● If n = 2 and the size of the observation alphabet is 3, we

need to collection statistics over 32 = 9 sequence types

● So the number of observations matters

Weather Example 01

Sunny Rainy Foggy

Sunny 0.8 0.05 0.15

Rainy 0.2 0.6 0.2

Foggy 0.2 0.3 0.5

Weather Today vs. Weather Tomorrow

04.08.005.0

||

,|

|,

1223

213

132

SunnywSunnywPSunnywRainywP

SunnywSunnywRainywP

SunnywRainywSunnywP

Here is how to read this table:

1st row: P(Sunny|Sunny)=0.8;

P(Rainy|Sunny)=0.05; P(Foggy|Sunny)=0.15

2nd row: P(Sunny|Rainy)=0.2;

P(Rainy|Rainy)=0.6; P(Foggy|Rainy)=0.2

3rd row: P(Sunny|Foggy)=0.2;

P(Rainy|Foggy)=0.3; P(Foggy|Foggy)=0.5

Weather Example 02

Sunny Rainy Foggy

Sunny 0.8 0.05 0.15

Rainy 0.2 0.6 0.2

Foggy 0.2 0.3 0.5

Weather Today vs. Weather Tomorrow

34.0

||

||

||

|,

|,

|,

|

1223

1223

1223

132

132

132

13

FoggywSunnywPSunnywRainywP

FoggywRainywPRainywRainywP

FoggywFoggywPFoggywRainywP

FoggywRainywSunnywP

FoggywRainywRainywP

FoggywRainywFoggywP

FoggywRainywP

Speech Recognition

w is a sequence of tokens

L is a language

y is an acoustic signal

wPwyPywP

yP

wPwyPywP

LwLw

LwLw

|maxarg|maxarg

|maxarg|maxarg

References

● Ch 06, D. Jurafsky & J. Martin. Speech & Language Processing,

Prentice Hall, ISBN 0-13-095069-6

● E. Fossler-Lussier, J. 1998. Markov Models & Hidden Markov

Models: A Brief Tutorial, ICSI, UC Berkley.

Science

Speech & NLP (Fall 2014): N-Grams, N-Gram Computation, Word Sequence Probabilities, N-Gram Smoothing, Markov Models