22
Three Basic Problems 1. Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2. Compute maximum probability tag (state) sequence Tagging/classification arg max T 1,N P m (T 1,N | W 1,N ) 3. Compute maximum likelihood model training / parameter estimation arg max m P m (W 1,N )

Three Basic Problems

Embed Size (px)

DESCRIPTION

Three Basic Problems. Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1, N ) Compute maximum probability tag (state) sequence Tagging/classification arg max T 1, N P m (T 1, N | W 1, N ) - PowerPoint PPT Presentation

Citation preview

Page 1: Three Basic Problems

Three Basic Problems

1. Compute the probability of a text (observation)• language modeling – evaluate alternative texts and

models

Pm(W1,N)2. Compute maximum probability tag (state) sequence

• Tagging/classification

arg maxT1,N Pm(T1,N | W1,N)

3. Compute maximum likelihood model • training / parameter estimation

arg maxm Pm(W1,N)

Page 2: Three Basic Problems

Compute Text Probability

• Recall: P(W,T) = i P(ti-1ti) P(wi | ti)• Text probability: need to sum P(W,T) over

all possible sequences – an exponential number

• Dynamic programming approach – similar to the Viterbi algorithm

• Will be used also for estimating model parameters from an untagged corpus

Page 3: Three Basic Problems

Forward AlgorithmDefine: Ai(k) = P(w1,k, tk= ti);

Nt – total num. of tags

For i = 1 To Nt: Ai(1) = m(t0ti)m(w1 | ti)

1. For k = 2 To N; For j = 1 To Nt:

i. Aj(k) = [iAi(k-1)m(titj)]m(wk | tj)2. Then:

Pm(W1,N) = iAi(N)

Complexity = O(Nt2 N) (like Viterbi, instead of max)

Page 4: Three Basic Problems

Forward Algorithm

t1

t2

t5

t4

t3

w1

t1

t2

t5

t4

t3

w2

t1

t2

t5

t4

t3

w3

A1(1)

A2(1)

A5(1)

A4(1)

A3(1)m(t0ti)

A1(2)

A2(2)

A5(2)

A4(2)

A3(2)

A1(3)

A2(3)

A5(3)

A4(3)

A3(3)

m(t1t1)

m(t2t1)

m(t3t1)

m(t4t1)

m(t5t1)

m(t1t1)

m(t2t1)

m(t3t1)

m(t4t1)

m(t5t1)

Pm(W1,3)

Page 5: Three Basic Problems

Backward AlgorithmDefine Bi(k) = P(wk+1,N | tk=ti)

1. For i = 1 To Nt: Bi(N) = 1

2. For k = N-1 To 1; For j = 1 To Nt:

i. Bj(k) = [i m(tjti)m(wk+1 | ti)Bi(k+1)]3. Then:

Pm(W1,N) = i m(t0ti)m(w1 | ti)Bi(1)

Complexity = O(Nt2 N)

Page 6: Three Basic Problems

Backward Algorithm

t1

t2

t5

t4

t3

w1

t1

t2

t5

t4

t3

w2

t1

t2

t5

t4

t3

w3

B1(1)

B2(1)

B5(1)

B4(1)

B3(1)

m(t0ti)

B1(2)

B2(2)

B5(2)

B4(2)

B3(2)

B1(3)

B2(3)

B5(3)

B4(3)

B3(3)

m(t1t1)

m(t2t1)

m(t3t1)

m(t4t1)

m(t5t1)

m(t1t1)

m(t2t1)

m(t3t1)

m(t4t1)

m(t5t1)

Pm(W1,3)

Page 7: Three Basic Problems

Estimation from Untagged Corpus: EM – Expectation-Maximization

1. Start with some initial model2. Compute the probability of (virtually) each state

sequence given the current model3. Use this probabilistic tagging to produce

probabilistic counts for all parameters, and use these probabilistic counts to estimate a revised model, which increases the likelihood of the observed output W in each iteration

4. Repeat until convergenceNote: No labeled training required. Initialize by

lexicon constraints regarding possible POS for each word (cf. “noisy counting” for PP’s)

Page 8: Three Basic Problems

Notation

• aij = Estimate of P(titj)

• bjk = Estimate of P(wk | tj)

• Ai(k) = P(w1,k, tk=ti)

(from Forward algorithm)

• Bi(k) = P(wk+1,N | tk=ti)

(from Backwards algorithm)

Page 9: Three Basic Problems

Estimating transition probabilities

Define pk(i,j) as prob. of traversing arc titj at time k given the observations:

pk(i,j) = P(tk = ti, tk+1 = tj | W)

= P(tk = ti, tk+1 = tj,W) / P(W)

=

=

tN

r rr

jjkiji

kBkA

kBbakA

1)()(

)1()(

t tN

r

N

s sjkrsr

jjkiji

kBbakA

kBbakA

1 1)1()(

)1()(

Page 10: Three Basic Problems

Expected transitions

• Define gi(k) = P(tk = ti | W), then:

gi(k) =

• Now note that:

– Expected number of transitions from tag i =

– Expected transitions from tag i to tag j =

tN

j k jip1

),(

N

k i kg1

)(

N

k k jip1

),(

Page 11: Three Basic Problems

Re-estimation of Maximum Likelihood Parameters

• a’ij =

=

• b’ik =

=

i

ji

tagfrom ns transitioof # expected

to tagfrom ns transitioof # expected

N

k i

N

k k

kg

jip

1

1

)(

),(

i

ik

tagfrom ns transitioofnumber expected

for tag of nsobservatio of # expected

N

k i

wwr

N

j r

kg

jipkr

t

1

: 1

)(

),(

Page 12: Three Basic Problems

EM Algorithm1. Choose initial model = <a,b,g(1)>2. Repeat until results don’t improve (much):

1. Compute pk based on current model, using Forward & Backwards algorithms to compute A and B (Expectation for counts)

2. Compute new model <a’,b’,g’(1)>(Maximization of parameters)

Note: Output likelihood is guaranteed to increase in each iteration, but might converge to a local maximum!

Page 13: Three Basic Problems

Initialize Model by Dictionary Constraints

• Training should be directed to correspond to the linguistic perception of POS (recall local max)

• Achieved by a dictionary with possible POS for each word

• Word-based initialization:– P(w|t) = 1 / #of listed POS for w, for the listed POS;

and 0 for unlisted POS• Class-based initialization (Kupiec, 1992):

– Group all words with the same possible POS into a ‘metaword’

– Estimate parameters and perform tagging for metawords

– Frequent words are handled individually

Page 14: Three Basic Problems

Some extensions for HMM POS tagging

• Higher-order models: trigrams, possibly interpolated with bigrams

• Incorporating text features:– Output prob = P(wi,fj

| tk) where f is a vector of features (capitalized, ends in –d, etc.)

– Features useful to handle unknown words

• Combining labeled and unlabeled training (initialize with labeled then do EM)

Page 15: Three Basic Problems

Transformational Based Learning (TBL) for Tagging

• Introduced by Brill (1995)• Can exploit a wider range of lexical and syntactic

regularities via transformation rules – triggering environment and rewrite rule

• Tagger:– Construct initial tag sequence for input – most frequent

tag for each word– Iteratively refine tag sequence by applying

“transformation rules” in rank order• Learner:

– Construct initial tag sequence for the training corpus– Loop until done:

• Try all possible rules and compare to known tags, apply the best rule r* to the sequence and add it to the rule ranking

Page 16: Three Basic Problems

Some examples

1. Change NN to VB if previous is TO– to/TO conflict/NN with VB

2. Change VBP to VB if MD in previous three– might/MD vanish/VBP VB

3. Change NN to VB if MD in previous two– might/MD reply/NN VB

4. Change VB to NN if DT in previous two– the/DT reply/VB NN

Page 17: Three Basic Problems

Transformation Templates

• Specify which transformations are possible

For example: change tag A to tag B when:1. The preceding (following) tag is Z

2. The tag two before (after) is Z

3. One of the two previous (following) tags is Z

4. One of the three previous (following) tags is Z

5. The preceding tag is Z and the following is W

6. The preceding (following) tag is Z and the tag two before (after) is W

Page 18: Three Basic Problems

LexicalizationNew templates to include dependency on surrounding

words (not just tags):Change tag A to tag B when:

1. The preceding (following) word is w2. The word two before (after) is w3. One of the two preceding (following) words is w4. The current word is w5. The current word is w and the preceding (following)

word is v6. The current word is w and the preceding (following) tag

is X (Notice: word-tag combination)7. etc…

Page 19: Three Basic Problems

Initializing Unseen Words• How to choose most likely tag for unseen

words?Transformation based approach:

– Start with NP for capitalized words, NN for others

– Learn “morphological” transformations from:Change tag from X to Y if:

1. Deleting prefix (suffix) x results in a known word2. The first (last) characters of the word are x3. Adding x as a prefix (suffix) results in a known word4. Word W ever appears immediately before (after) the word5. Character Z appears in the word

Page 20: Three Basic Problems

UnannotatedInput Text

AnnotatedText

Ground Truth forInput Text

Rules

Setting InitialState

Learning Algorithm

TBL Learning Scheme

Page 21: Three Basic Problems

Greedy Learning Algorithm

• Initial tagging of training corpus – most frequent tag per word

• At each iteration: – Compute “error reduction” for each

transformation rule:• #errors fixed - #errors introduced

– Find best rule; If error reduction greater than a threshold (to avoid overfitting):• Apply best rule to training corpus

• Append best rule to ordered list of transformations

Page 22: Three Basic Problems

Morphological Richness

• Parts of speech really include features:– NN2 Noun(type=common,num=plural)

This is more visible in other languages with richer morphology:

– Hebrew nouns: number, gender, possession– German nouns: number, gender, case, …– And so on…