Introduction to Language Modelingfraser/reading... · Outline 1. Introduction 2. Smoothing Methods...

Introduction to Language Modeling(Parsing and Machine Translation Reading Group)

Christian Scheible

October 23, 2008

Christian Scheible Introduction to Language Modeling October 23, 2008 1 / 26

Outline

1. Introduction

2. Smoothing Methods

Language Models

Probability distribution for word sequences:p(w1, ...,wn) = p(wn

⇒ How likely is a sentence to be produced?

• Machine translation

• Speech recognition

• ...

N-gram Models

• Conditional probabilities for words given their N predecessors

• Independence assumption!

→ Approximation

Example

p(My dog finds a bone) ≈p(My) p(dog|my) p(finds|my,dog) p(a|dog,finds) p(bone|finds, a)

Why Smoothing?

Smoothing: Adjusting maximum likelihood estimates to produce moreaccurate probabilities

Why Smoothing?

• Probabilities: relative frequencies (from corpus)

p(w1, ...,wn) =f (w1, ...,wn)

N• But not all combinations of w1, ...wn are covered

→ Data sparseness leads to probability 0 for many words

Example

p(My dog found two bones)p(My dog found three bones)p(My dog found seventy bones)p(My dog found eighty sausages)p(My dog found ... ...)

Smoothing Models – Overview

• Laplace Smoothing

• Additive Smoothing

• Good-Turing Estimate

• Katz Backoff

• Interpolation (Jelinek-Mercer)

• Absolute Discounting

• Witten-Bell Smoothing

• Kneser-Ney Smoothing

Laplace Smoothing (1)

• Add 1 to the numerator

• Add the size of the vocabulary (V ) to the denominator

p(w1, ...,wn) =f (w1, ...,wn) + 1

N + |V |

Laplace Smoothing (2)

Problems• Intended for uniform distributions, language data usually produces

Zipf distributions

• Overestimates new items, underestimates known items

• Unrealistically alters probabilities of items with high frequencies

MLE Empirical Laplace0 0.000027 0.0001371 0.448 0.0002472 1.25 0.0004113 2.24 0.0005484 3.23 0.0006855 4.21 0.0008226 5.23 0.0009597 6.21 0.001098 7.21 0.001239 8.26 0.00137

Additive Smoothing

• Add λ to the numerator

• Add λ× |V | to the denominator

p(w1, ...wn) =f (w1, ...wn) + λ

N + λ|V |

Problems• λ needs to be determined

• Same issues as Laplace smoothing

Good-Turing Estimate (1)

• Idea: Reallocate the probability mass of events that occur n + 1 timesto the events that occur n times

• For each frequency f , produce an adjusted (expected) frequency f ∗

f ∗ ≈ (f + 1)E (nf +1)

E (nf )≈ (f + 1)

pGT (w1, ...,wn) =f ∗(w1, ...,wn)∑

f ∗(X )

Good-Turing Estimate (2)

MLE Test Set Laplace Good-Turing0 0.000027 0.000137 0.0000271 0.448 0.000274 0.4462 1.25 0.000411 1.263 2.24 0.000548 2.244 3.23 0.000685 3.245 4.21 0.000822 4.226 5.23 0.000959 5.197 6.21 0.00109 6.218 7.21 0.00123 7.249 8.26 0.00137 8.25

Problems• What if nr is 0?

→ nr need to be smoothed

• No combination with other distributions

Katz Backoff (1)

• Combines probability distributions

• Idea: Higher- and lower-order distributions

• Recursive smoothing up to unigrams

p(wi |wk , ...,wi−1) =

f (wk , ...,wn)δ

f (wk , ...,wn−1)if f (...) > 0

α p(wi |wk+1, ...wi−1) else

α ensures that the sum of all probabilities equals 1

Katz Backoff (2)

Example

Seen: p(two bones)Unseen: p(some bones)

Known: p(bones|two)Unknown: p(bones|some)

Known: p(bones)!

The higher-order frequency is related to the lower-order frequency.Important:p(bones|some) < p(bones|two)

Interpolation (Jelinek-Mercer)

• Another method to combine probability distributions

• Weighted average

p(wi |wk , ...,wi−1) =k∑

λj p(wi |wk+j , ...,wi−1)

About λj

• The sum of all λj is 1

• Use held-out data to determine the best set of values

Absolute Discounting (1)

• Subtract a fixed number from all frequencies

• Uniformly assign the sum of these numbers to new events

p(w1, ...,wn) =

f (w1, ...,wn)− δ

Nif f (w1, ...,wn)− δ > 0

(A− n0) δ

n0 Nelse

nn: the number of events that occurred n times

Absolute Discounting (2)

How to determine δ?

δ =n1

n1 + 2n2

Problems• Still wrong results for events with frequency 1

→ Solution: Different discounts for n1, n2, n3+

Witten-Bell Smoothing (1)

• Instance of Interpolation

• λk : Probability of using the higher-order model

pWB(wi |wk , ...,wi−1) =λk pML(wi |wk , ...,wi−1) + (1− λk) pWB(wi |wk+1, ...,wi−1)

Witten-Bell Smoothing (2)

• Thus: 1− λk : Probability of using the lower-order model

• Idea: This is related to the number of different words that follow thehistory wk , ...,wi−1

N1+(wk , ...,wi−1, •) = No. of different words that follow wk , ...,wi−1

1− λk =N1+(wk , ...,wi−1, •)

N1+(wl , ...,wi−1, •) +∑

wif (wk , ...wi )

Kneser-Ney Smoothing

• Extension of Absolute Discounting:New way to build the lower-order distribution

Problem Case

”San Francisco”:

• Francisco is common but only occurs after San

• Absolute Discounting will yield high p(Francisco|new word)

• Not intended!

• Unigram probability should not be proportional to its frequency but tothe number of different words it follows

Modified Kneser-Ney Smoothing (1)

Similarities to Absolute Discounting

• Interpolation with lower-order model

• Discounts

pkn(w1|wk , ...,wi−1) =

f (wk , ...,wi )− D(f )∑wi

f (wk , ...,wi )+ γ(wk , ...,wi−1) pkn(wi |wk+1, ...,wi−1)

Discount function D(f)

• Different discounts for frequencies 1, 2 and 3+

D(f ) =

0 if n = 0

D1 = 1− 2Y n2n1

if n = 1

D2 = 2− 3Y n3n2

if n = 2

D3+ = 3− 4Y n4n3

if n ≥ 3

Ensures that the distribution sums to 1.

γ(wk , ...,wi−1) =

D1N1(wk , ...,wi−1) + D2N2(wk , ...,wi−1) + D3N3(wk , ...,wi−1)∑wi

f (wk , ...,wi )

Evaluation

Cross-Entropy Evaluation

• Cross-entropy (H): The average number of bits needed to identifyan event from a set of possibilities

• In our case: How good is the approximation our smoothing methodproduces?

• Calculate cross-entropy for each method on test data

H(p, q) = −∑x

p(x)log q(x)

Evaluation

100 1000 10000 100000 1e+06 1e+07

training set size (sentences)

relative performance of algorithms on WSJ/NAB corpus, 2-gram

jelinek-mercer-baseline

k-nkneser-ney-mod

abs-disc-int

witten-bell-backoff

100 1000 10000 100000 1e+06 1e+07

relative performance of algorithms on WSJ/NAB corpus, 3-gram

katzkneser-ney

kneser-ney-mod

abs-disc-interp

witten-bell-backoff

100 1000 10000 100000 1e+06

relative performance of algorithms on Broadcast News corpus, 2-gram

kneser-ney

kneser-ney-mod

abs-disc-interp

witten-bell-backoff

100 1000 10000 100000 1e+06

relative performance of algorithms on Broadcast News corpus, 3-gram

baseline

kneser-ney

kneser-ney-mod

abs-disc-interp

jelinek-mercer

witten-bell-backoff

References

1 Stanley F. Chen and Joshua GoodmanAn empirical study of smoothing techniques for language modelingProceedings of the 34th annual meeting of the Association for Computational Linguistics,

2 Joshua GoodmanThe State of the Art in Language Model Smoothinghttp://research.microsoft.com/∼joshuago/lm-tutorial-public.ppt

3 Bill MacCartneyNLP Lunch Tutorial: Smoothinghttp://nlp.stanford.edu/∼wcmac/papers/20050421-smoothing-tutorial.pdf

4 Christopher Manning and Hinrich SchutzeFoundations of statistical natural language processingMIT Press, 1999

5 Helmut SchmidStatistische Methoden in der Maschinellen Sprachverarbeitunghttp://www.ims.uni-stuttgart.de/∼schmid/ParsingII/current/snlp-folien.pdf

Introduction to Language Modelingfraser/reading... · Outline 1. Introduction 2. Smoothing Methods...

Documents

Introduction END Introduction Elements Language BEGIN

INTRODUCTION TO NATURAL LANGUAGE PROCESSING (Home ... · Introduction to Natural Language Processing (Ms) Chappelier, J.-C. & Rajman, M. INTRODUCTION TO NATURAL LANGUAGE PROCESSING

Language variation Introduction to Linguistics. LANGUAGE VARIETIES

Introduction to Language & Language Comprehension

BALRIC Introduction Language Resource

Introduction to Figurative Language

Ruby Programming Language - Introduction

IT-Outsourcing & Cloud-Computing - Präsentation von Tobias Scheible

Language Current ABSTRACT INTRODUCTION

Packaging Industry Trends - RISI · Packaging Industry Trends David Scheible, Chairman and CEO of Graphic Packaging International, Inc.David Scheible, Chairman and CEO of Graphic

Language Development- An Introduction

Introduction to Film Language

Introduction to PASCAL language

Is Language A Language Language? An introduction to the analytic systems … · · 2010-02-20#1... Is Language A Language Language? An introduction to the analytic systems of Noam

Python Language Introduction

Peace Corps Basic Introduction to Thai Language - Live · PDF fileBasic Introduction to Thai Language . ... Sanskrit, English, ... Peace Corps Basic Introduction to Thai Language

Seminar in Mind & Language Introduction. Language

Language - An Introduction

C language introduction

Introduction Programming Language