26
Pitman-Yor Process in statistical language models J Li September 28, 2011 J Li Pitman-Yor Process in statistical language models

Pitman-Yor Process in statistical language models

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Pitman-Yor Process in statistical language models

Pitman-Yor Process in statistical language models

J Li

September 28, 2011

J Li Pitman-Yor Process in statistical language models

Page 2: Pitman-Yor Process in statistical language models

Pitman-Yor Process

General stick-breaking prior: P(.) =∑N

k=1 pkδZk(.):

Generating values: Zk ∼ HAssigning weights:

pk =k−1∏i=1

(1− Vi )Vk Vk ∼ Beta(ak , bk)

Pitman-Yor as a special case: PY(a, b,H), a ∈ [0, 1), b > −a

Vk ∼ Beta(1− a, a + bk)Weights {pk}Nk=1 induces a power-law distribution:

P(nw ) ∝ nw−(1+a)

J Li Pitman-Yor Process in statistical language models

Page 3: Pitman-Yor Process in statistical language models

Pitman-Yor Process

General stick-breaking prior: P(.) =∑N

k=1 pkδZk(.):

Generating values: Zk ∼ HAssigning weights:

pk =k−1∏i=1

(1− Vi )Vk Vk ∼ Beta(ak , bk)

Pitman-Yor as a special case: PY(a, b,H), a ∈ [0, 1), b > −a

Vk ∼ Beta(1− a, a + bk)Weights {pk}Nk=1 induces a power-law distribution:

P(nw ) ∝ nw−(1+a)

J Li Pitman-Yor Process in statistical language models

Page 4: Pitman-Yor Process in statistical language models

Pitman-Yor Process

General stick-breaking prior: P(.) =∑N

k=1 pkδZk(.):

Generating values: Zk ∼ HAssigning weights:

pk =k−1∏i=1

(1− Vi )Vk Vk ∼ Beta(ak , bk)

Pitman-Yor as a special case: PY(a, b,H), a ∈ [0, 1), b > −a

Vk ∼ Beta(1− a, a + bk)Weights {pk}Nk=1 induces a power-law distribution:

P(nw ) ∝ nw−(1+a)

J Li Pitman-Yor Process in statistical language models

Page 5: Pitman-Yor Process in statistical language models

Why power-law?

Think of the proportions brokenoff the remaining stick,Vk ∼ Beta(1− a, b + ka):

E [Vk ] =1− a

1 + b + (k − 1)a

a = 0 reduces to DP

Suppose we set a = 0.5, b = 0,the pdfs of successive Vk (onthe right):

J Li Pitman-Yor Process in statistical language models

Page 6: Pitman-Yor Process in statistical language models

Why power-law?

Think of the proportions brokenoff the remaining stick,Vk ∼ Beta(1− a, b + ka):

E [Vk ] =1− a

1 + b + (k − 1)a

a = 0 reduces to DP

Suppose we set a = 0.5, b = 0,the pdfs of successive Vk (onthe right):

J Li Pitman-Yor Process in statistical language models

Page 7: Pitman-Yor Process in statistical language models

Two stage language model

Using the CRP analogy:

The “generator” labels tableswith word types: lk ∼ π(l |θ)

The “adaptor” assignscustomers to tables:zk ∼ PY(a, b)

The outcome is a steam ofword tokens:w1 = lz1 ,w2 = lz2 , · · ·

note: this is not necessarily a truePitman-Yor.

J Li Pitman-Yor Process in statistical language models

Page 8: Pitman-Yor Process in statistical language models

Two stage language model

Using the CRP analogy:

The “generator” labels tableswith word types: lk ∼ π(l |θ)

The “adaptor” assignscustomers to tables:zk ∼ PY(a, b)

The outcome is a steam ofword tokens:w1 = lz1 ,w2 = lz2 , · · ·

note: this is not necessarily a truePitman-Yor.

J Li Pitman-Yor Process in statistical language models

Page 9: Pitman-Yor Process in statistical language models

Two stage language model

Using the CRP analogy:

The “generator” labels tableswith word types: lk ∼ π(l |θ)

The “adaptor” assignscustomers to tables:zk ∼ PY(a, b)

The outcome is a steam ofword tokens:w1 = lz1 ,w2 = lz2 , · · ·

note: this is not necessarily a truePitman-Yor.

J Li Pitman-Yor Process in statistical language models

Page 10: Pitman-Yor Process in statistical language models

Prediction rule (Polya Urn)

Compare with DPprediction rule

The authors set b = 0.Why?

Maybe because itsvalue makes nodifference.

J Li Pitman-Yor Process in statistical language models

Page 11: Pitman-Yor Process in statistical language models

Prediction rule (Polya Urn)

Compare with DPprediction rule

The authors set b = 0.Why?

Maybe because itsvalue makes nodifference.

J Li Pitman-Yor Process in statistical language models

Page 12: Pitman-Yor Process in statistical language models

Prediction rule (Polya Urn)

Compare with DPprediction rule

The authors set b = 0.Why?

Maybe because itsvalue makes nodifference.

J Li Pitman-Yor Process in statistical language models

Page 13: Pitman-Yor Process in statistical language models

Estimate based on tokens or types?

Simplified setting: givenobservation of a set of Nwords, derive a distributionover all words.

Based on tokens:π̂w ,1 = nw

N

Based on types:π̂w ,2 ∝ I (w ∈W)

Interpolate between them:π̂ = α(nw )π̂w ,1 + βπ̂w ,2

J Li Pitman-Yor Process in statistical language models

Page 14: Pitman-Yor Process in statistical language models

Estimate based on tokens or types?

Simplified setting: givenobservation of a set of Nwords, derive a distributionover all words.

Based on tokens:π̂w ,1 = nw

N

Based on types:π̂w ,2 ∝ I (w ∈W)

Interpolate between them:π̂ = α(nw )π̂w ,1 + βπ̂w ,2

J Li Pitman-Yor Process in statistical language models

Page 15: Pitman-Yor Process in statistical language models

Interpolated Kneser-Ney (IKN)

Task: estimate distribution of (wN+1|wN−n+2···N);

Given: a vector of N words w that share a common history(n − 1 previous words) and vectors of words with differenthistories w(1), · · ·w(H)

IKN estimator:

J Li Pitman-Yor Process in statistical language models

Page 16: Pitman-Yor Process in statistical language models

Interpolated Kneser-Ney (IKN)

Task: estimate distribution of (wN+1|wN−n+2···N);

Given: a vector of N words w that share a common history(n − 1 previous words) and vectors of words with differenthistories w(1), · · ·w(H)

IKN estimator:

J Li Pitman-Yor Process in statistical language models

Page 17: Pitman-Yor Process in statistical language models

Two stage language model revisited: adjust parameters

Likelihood:

a ∈ [0, 1)

When a −→− 1,Γ(1− a) −→∞, soΓ(nzk−a)Γ(1−a) −→ 0 unless n

(z)k = 1

When a −→+ 0, K (z) tendsto be small small, due to theaK(z) term

Actually, K (z) ≈ numberof distinct words in w

J Li Pitman-Yor Process in statistical language models

Page 18: Pitman-Yor Process in statistical language models

Two stage language model revisited: adjust parameters

Likelihood:

a ∈ [0, 1)

When a −→− 1,Γ(1− a) −→∞, soΓ(nzk−a)Γ(1−a) −→ 0 unless n

(z)k = 1

When a −→+ 0, K (z) tendsto be small small, due to theaK(z) term

Actually, K (z) ≈ numberof distinct words in w

J Li Pitman-Yor Process in statistical language models

Page 19: Pitman-Yor Process in statistical language models

Two stage language model revisited: adjust parameters

Likelihood:

a ∈ [0, 1)

When a −→− 1,Γ(1− a) −→∞, soΓ(nzk−a)Γ(1−a) −→ 0 unless n

(z)k = 1

When a −→+ 0, K (z) tendsto be small small, due to theaK(z) term

Actually, K (z) ≈ numberof distinct words in w

J Li Pitman-Yor Process in statistical language models

Page 20: Pitman-Yor Process in statistical language models

Correspondence

Two-stage model:

IKN model:

J Li Pitman-Yor Process in statistical language models

Page 21: Pitman-Yor Process in statistical language models

Application: morphology

Generator:

inflection class: ck ∼ mult(κ)stem: tk |ck ∼ mult(τ)suffix: fk |ck ∼ mult(φ)lk = tk · fk

Adaptor: zk ∼ PY(a, 0)

Output: wk = lzk

J Li Pitman-Yor Process in statistical language models

Page 22: Pitman-Yor Process in statistical language models

Gibbs Sampling

Update θ = (c , t, f ):

Update z :

J Li Pitman-Yor Process in statistical language models

Page 23: Pitman-Yor Process in statistical language models

Evaluation

a = 0 works the best,therefore estimation bytype is justified

Wait a minute, doesn’ta = 0 correspond toDP?

author claims the value ofmodel lies in flexibility

J Li Pitman-Yor Process in statistical language models

Page 24: Pitman-Yor Process in statistical language models

Evaluation

a = 0 works the best,therefore estimation bytype is justified

Wait a minute, doesn’ta = 0 correspond toDP?

author claims the value ofmodel lies in flexibility

J Li Pitman-Yor Process in statistical language models

Page 25: Pitman-Yor Process in statistical language models

Evaluation

a = 0 works the best,therefore estimation bytype is justified

Wait a minute, doesn’ta = 0 correspond toDP?

author claims the value ofmodel lies in flexibility

J Li Pitman-Yor Process in statistical language models

Page 26: Pitman-Yor Process in statistical language models

Discussion

Questions?

J Li Pitman-Yor Process in statistical language models