30
1 Smoothing LING 570 Fei Xia Week 5: 10/24/07

1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A

  • View
    214

  • Download
    1

Embed Size (px)

Citation preview

Page 1: 1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A

1

Smoothing

LING 570

Fei Xia

Week 5: 10/24/07

Page 2: 1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A

2

Smoothing

• What?

• Why?– To deal with events observed zero times.– “event”: a particular ngram

• How?– To shave a little bit of probability mass from the higher

counts, and pile it instead on the zero counts

– For the time being, we assume that there are no unknown words; that is, V is a closed vocabulary.

Page 3: 1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A

3

Smoothing methods

• Laplace smoothing (a.k.a. Add-one smoothing)• Good-Turing Smoothing

• Linear interpolation (a.k.a. Jelinek-Mercer smoothing)• Katz Backoff• Class-based smoothing

• Absolute discounting • Kneser-Ney smoothing

Page 4: 1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A

4

Laplace smoothing

• Add 1 to all frequency counts.

• Let V be the vocabulary size.

P (wi jwi ¡ 1) = c(wi ;wi ¡ 1)c(wi ¡ 1)P (wi jwi ¡ 1)

Page 5: 1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A

5

Laplace smoothing: n-gram

Bigram:

n-gram:

Page 6: 1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A

6

Problem with Laplace smoothing

• Example: |V|=100K, a bigram “w1 w2” occurs 10 times, and the bigram ‘w1 w2 w3” occurs 9 times.– P_{MLE} (w3 | w1, w2) = 0.9– P_{Lap}(w3 | w1, w2) = (9+1)/(10+100K) =

0.0001

• Problem: give too much probability mass to unseen n-grams.

Add-one smoothing works horribly in practice.

Page 7: 1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A

7

It works better than add-one, but still works horribly.

Need to choose

Page 8: 1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A

8

Good-Turing smoothing

Page 9: 1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A

9

Basic ideas• Re-estimate the frequency of zero-count N-grams with the

number of N-grams that occur once.

• Let Nc be the number of n-grams that occurred c times.

• The Good-Turing estimate for any n-gram that occurs c times, we should pretend that it occurs c* times:

Page 10: 1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A

10

N0 is the number of unseen ngrams

For unseen n-grams, we assume that each of them occurs c0

* times.

The total prob mass for unseen ngrams is:

Therefore, the prob of EACH unseen ngram is:

Page 11: 1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A

11

An example

Page 12: 1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A

12

N-gram counts to conditional probability

c* comes from GT estimate.

Page 13: 1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A

13

Issues in Good-Turing estimation

• If Nc+1 is zero, how to estimate c*?– Smooth Nc by some functions: Ex: log(Nc) = a + b log(c)

– Large counts are assumed to be reliabl gt_max[ ] Ex: c* = c for c > gt_max

• May also want to treat n-grams with low counts (especially 1) as zeroes gt_min[ ].

• Need to renormalize all the estimate to ensure that the probs add to one.

• Good-Turing is often not used by itself; it is used in combination with the backoff and interpolation algorithms.

Page 14: 1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A

14

One way to implement Good-Turing

• Let N be the number of trigram tokens in the training corpus, and min3 and max3 be the min and max cutoffs for trigrams.

• From the trigram counts

calculate N_0, N_1, …, Nmax3+1, and N calculate a function f(c) , for c=0, 1, …, max3.

• Define c* = c if c > max3 = f(c) otherwise

• Do the same for bigram counts and unigram counts.

Page 15: 1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A

15

Good-Turing implementation (cont)

• Estimate trigram conditional prob:

• For an unseen trigram, the joint prob is:

Do the same for unigram and bigram models

Page 16: 1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A

16

Backoff and interpolation

Page 17: 1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A

17

N-gram hierarchy

• P3(w3|w1,w2), P2(w3|w2), P1(w3)

• Back off to a lower N-gram

backoff estimation

• Mix the probability estimates from all the N-grams interpolation

Page 18: 1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A

18

Katz Backoff

Pkatz (wi|wi-2, wi-1) =

P3(wi |wi-2, wi-1) if c(wi-2, wi-1, wi) > 0

®(wi-2, wi-1) Pkatz (wi|wi-1) o.w.

Pkatz (wi|wi-1) =

P2(wi | wi-1) if c(wi-1, wi) > 0

®(wi-1) P1(wi) o.w

Page 19: 1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A

19

Katz backoff (cont)

• ® are used to normalize probability mass so that it still sums to 1, and to “smooth” the lower order probabilities that are used.

• See J&M Sec 4.7.1 for details of how to calculate ® (and M&S 6.3.2 for additional discussion)

Page 20: 1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A

20

Jelinek-Mercer smoothing(interpolation)

Bigram:

Trigram:

Page 21: 1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A

21

Interpolation (cont)

How to set the value for ¸ i ?

Page 22: 1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A

22

How to set ¸i?

• Generally, here’s what’s done:– Split data into training, held-out, and test– Train model on training set– Use held-out to test different values and pick

the ones that works best (i.e., maximize the likelihood of the held-out data)

– Test the model on the test data

Page 23: 1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A

23

So far

• Laplace smoothing:

• Good-Turing Smoothing: gt_min[ ], gt_max[ ].

• Linear interpolation: ¸i(wi-2,wi-1)

• Katz Backoff: ®(wi-2,wi-1)

Page 24: 1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A

24

Additional slides

Page 25: 1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A

25

Another example for Good-Turing

• 10 tuna, 3 unagi, 2 salmon, 1 shrimp, 1 octopus, 1 yellowtail

• How likely is octopus? Since c(octopus) = 1. The GT estimate is 1*.

• To compute 1*, we need n1=3 and n2=1.

• What happens when Nc = 0?

Page 26: 1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A

26

Absolute discounting

What is the value for D?

How to set ®(x)?

Page 27: 1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A

27

Intuition for Kneser-Ney smoothing

• I cannot find my reading __

• P(Francisco | reading ) > P(glasses | reading)– Francisco is common, so interpolation gives

P(Francisco | reading) a high value

– But Francisco occurs in few contexts (only after San), whereas glasses occurs in many contexts.

– Hence weight the interpolation based on number of contexts for the word using discounting

Words that have appeared in more contexts are more likely to appear in some new context as well.

Page 28: 1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A

28

Kneser-Ney smoothing (cont)

Interpolation:

Backoff:

Page 29: 1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A

29

Class-based LM

• Examples: – The stock rises to $81.24– He will visit Hyderabad next January

• P(wi | wi-1)

¼ P(ci-1 | wi-1) * P(ci | ci-1) * P(wi | ci)

• Hard clustering vs. soft clustering

Page 30: 1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A

30

Summary• Laplace smoothing (a.k.a. Add-one smoothing)

• Good-Turing Smoothing

• Linear interpolation: ¸(wi-2,wi-1), ¸(wi-1)

• Katz Backoff: ®(wi-2,wi-1), ®(wi-1)

• Absolute discounting: D, ®(wi-2,wi-1), ®(wi-1)

• Kneser-Ney smoothing: D, ®(wi-2,wi-1), ®(wi-1)

• Class-based smoothing: clusters