Terry Koo and Michael Collins - Massachusetts Institute of ...people.csail.mit.edu/maestro/papers/koo05emnlp-slides.pdf · Representing NLP structures Proper representation is critical

Hidden–Variable Models forDiscriminative Reranking

Terry Koo and Michael Collins{maestro|mcollins}@csail.mit.edu

Overview of reranking

The reranking approachUse a baseline model to get the N-best candidates“Rerank” the candidates using a more complex model

Parse rerankingCollins (2000): 88.2% ⇒ 89.8%Charniak and Johnson (2005): 89.7% ⇒ 91.0%Talk by Brooke Cowan in 7B: 83.6% ⇒ 85.1%

Also applied toMT (Och and Ney, 2002; Shen et al., 2004)NL Generation (Walker et al., 2001)

Representing NLP structures

Proper representation is critical to success

Hand–crafted feature vector representations

Φ( ) = {0, 1, 2, 0, 0, 3, 0, 1}

Features defined through kernels

K ( , ) = Φ( )·Φ( )

This talk: A new approach using hiddenvariables

Two facets of lexical items

Different lexical items can have similarmeanings, e.g. president and chairman

Clustering: president, chairman ∈ NounCluster 4

A single lexical item can have differentmeanings, e.g. [river] bank vs [financial] bank

Refinement: bank 1, bank 2 ∈ bank

Model clusterings and refinements as hiddenvariables that support the reranking task

Highlights of the approach

Conditional log–linear model with hiddenvariables

Dynamic programming is used for training anddecoding

Clustering and refinement done automaticallyusing a discriminative criterion

Overview of talk

Motivation

Design

General form of the model

Training and decoding efficiently

Creating specific instantiations

Results

Discussion

Conclusion

The parse reranking framework

Sentences si for 1 ≤ i ≤ n

s1: Pierre Vinken , 61 years old , will join ...

s2: Mr. Vinken is chairman of Elsevier N.V. ...

s3: Big Board Chairman John Phelan said yesterday ...

Each si has candidate parses t i,j for 1 ≤ j ≤ ni

t i,1 is the best candidate parse for si


t i,j has phrase structure and dependency tree

Mr. Vinken is chairman of Elsevier N.V.

NNP

NP NP

PP

NP

VP

S

NP

NNPNNP VB NN IN NNP


t i,j has phrase structure and dependency tree

Vinken is chairman of Elsevier N.V.Mr.

NNP VB NN IN NNP NNP

NPNP NP

PP

NP

VP

S

NNP

Adding hidden variables

Hidden–value domains Hw(t i,j) for 1 ≤ w ≤ len(si)

1NN

NN1

2

NN3

IN1

IN3

IN2

NNP1

NNP3

NNP2

NNP

NNP2

3

NNP


NNP1

NNP3

NNP2 NNP

NNP2

3

NNP1

VB

VB2

3

VB1

S

NNPNNP VB NN IN NNP NNP

NPNP NP

PP

NP

VP

Adding hidden variables

Assignment h ∈ H1(t i,j) × ... × H len(si)(t i,j)

NNNNP1

NNP3

NNP2 NNP

NNP2

3

NNP1

VB

VB2

3

VB1

NN1

2

NN3 NNP

NNP2

3

NNP1NNP1

NNP3

NNP2

IN1

IN3

IN2



NPNP NP

PP

NP

VP

S

Marginalized probability model

Φ(t i,j, h) produces a descriptive vector of featureoccurrence counts, e.g.

Φ2(t i,j, h) = Count( chairman has hidden value NN 1)

Φ13(t i,j, h) = Count(NNP 2 is a direct object of VB 1)

Φ19(t i,j, h) = Count(NN 1 coordinates with NN 2)

Marginalized probability model

Log–linear distribution over ( t i,j, h) withparameters Θ:

p(t i,j , h | si , Θ) = eΦ(t i,j , h)·Θ∑

j′ ,h′

eΦ(t i,j′ , h′)·Θ

Marginalize over assignments h:

p(t i,j | si , Θ) =∑

hp(t i,j , h | si , Θ)

Optimizing the parameters

Define loss as negative log-likelihood

L(Θ) = -n∑

i=1log p(t i,1 | si , Θ)

Minimize L(Θ) through gradient descent

∂L∂Θ

= -∑

i

∑

hp(h | t i,1, si , Θ)Φ(t i,1, h)

+∑

i,jp(t i,j | si , Θ)

∑

hp(h | t i,j , si , Θ)Φ(t i,j , h)

Overview of talk

Motivation

Design




Results

Discussion

Conclusion

Problems with efficiency

|H1(t i,j) × ... × H len(si)(t i,j)| grows exponentially, sotraining the model is intractable:

∂L∂Θ

= -∑

i

∑

hp(h | t i,1, si , Θ)Φ(t i,1, h)

+∑


∑


Decoding the model is also intractable:

p(t i,j | si , Θ) =∑


Problems with efficiency

|H1(t i,j) × ... × H len(si)(t i,j)| grows exponentially, sotraining the model is intractable:

∂L∂Θ

= -∑

i

∑

hp(h | t i,1, si , Θ)Φ(t i,1, h)

+∑


∑


Decoding the model is also intractable:

p(t i,j | si , Θ) =∑


Locality constraint on features

Features have pairwise local scope on hiddenvariables

Features still have global scope on non-hiddeninformation

Φ can be factored into local feature vectors,allowing dynamic programming

Local feature vectors

Define two kinds of local feature vector φ:

Single-variable φ(t i,j, w, hw) look at a single hiddenvariable

Pairwise φ(t i,j, u, v, hu, hv) look at two hidden variablesin a dependency relationship


Φ(t i,j, h) looks at every hidden variable

NNNNP1

NNP3

NNP2 NNP

NNP2

3

NNP1

VB

VB2

3

VB1

NN1

2

NN3 NNP

NNP2

3

NNP1NNP1

NNP3

NNP2

IN1

IN3

IN2



NPNP NP

PP

NP

VP

S


φ(t i,j, chairman, NN3) only sees NN 3

3

NN

NN1

2

NN



NPNP NP

PP

NP

VP

S


φ(t i,j, chairman, of, NN3, IN2) sees NN3 and IN2

1

IN3

IN2

NN

NN1


2

NN3

IN

S


NPNP NP

PP

NP

VP


Rewrite global Φ as a sum over local φ

Φ(t i,j , h) =∑

w ∈ ti,j

φ(t i,j , w, hw )

+∑

(u,v ) ∈ D(ti,j )φ(t i,j , u, v, hu, hv )


Rewrite global Φ as a sum over local φ

Φ(t i,j , h) =∑

w ∈ ti,j

φ(t i,j , w, hw )

+∑

(u,v ) ∈ D(ti,j )φ(t i,j , u, v, hu, hv )

Applying belief propagation

New restrictions enable dynamic–programmingapproaches, e.g. belief propagation

BP generalizes the forward–backward algorithm from achain to a tree

Runtime O(len(si)H2), H = max |Hw(t i,j)|

BP efficiently computes∑


∑


Overview of talk

Motivation

Design




Results

Discussion

Conclusion

Two areas for choice in the model

Definition of the hidden–value domains Hw(t i,j)

Definition of the feature vectors φ

Hidden–value domains

Lexical domains allow word refinement

chairmanMr. Vinken is of Elsevier N.V.

chairman

chairman Elsevier

Elsevier

Elsevier

1

2

3

1

2

3

1

3

1

2

3

Mr.

Mr.

Mr.

1

2

3

2 of

of

of 1

2

3

N.V.

N.V.

N.V.

1

2

3

Vinken

Vinken

Vinken

is

is

is

chairman


Lexical domains allow word refinement

chairmanMr. Vinken is of Elsevier N.V.

Elsevier

Elsevier

1

2

3

1

2

3

1

2

3

Mr.

Mr.

Mr.

1

2

3

chairman

chairman 1

2

3chairman

of

of

of 1

2

3

N.V.

N.V.

N.V.

1

2

3

Vinken

Vinken

Vinken

is

is

is

Elsevier


Part-of-speech domains allow word clustering

ElsevierMr. Vinken is chairman of N.V.

1

2

3

4

5

NNP

NNP

NNP

NNP

NNP

1

2

3

4

5

NNP

NNP

NNP

NNP

NNP

1

2

3

4

5

VB

VB

VB

VB

VB

NN

NN

NNP

NN

NN1

2

3

4

5

1

2

3

4

5

IN

IN

IN

IN

IN

NNNNP

NNP

NNP

NNP

1

2

3

4

5

1

2

3

4

5

NNP

NNP

NNP

NNP

NNP



ElsevierMr. Vinken is of N.V.chairman

1

2

3

4

5

NNP

NNP

NNP

NNP

NNP

1

2

3

4

5

NNP

NNP

NNP

NNP

NNP

1

2

3

4

5

VB

VB

VB

VB

VB

1

2

NNP

4

5

IN

IN

IN

IN

IN

NN

NN

NN

NN

NN1

2

3

4

5

3

NNP

NNP

NNP

NNP

1

2

3

4

5

1

2

3

4

5

NNP

NNP

NNP

NNP

NNP



Elsevieris chairman ofMr. Vinken N.V.

1

2

3

4

5

IN

IN

IN

IN

IN

1

2

3

4

5

NNP

NNP

NNP

NNP

NNP

NNP

NNP

NNP

NNP

NNP

1

2

3

4

5

1

2

1

4

5

NNP

NNP

NNP

NNP

NNP

1

2

3

4

5

NNP

NNP

NNP

NNP

NNP

3

2

3

4

5

VB

VB

VB

VB

VB

NN

NN

NN

NN

NN1

2

3

4

5


Supersense domains model WordNet ontology(Ciaramita and Johnson, 2003; Miller et at., 1993)

Vinken Elsevier N.V.of

is

chairman

Mr.

3

noun.person

noun.person

noun.person1

2

3

4

5

NNP

NNP

NNP

NNP

NNP

1

2

3

4

5

NNP

NNP

NNP

NNP

NNP

1

2

3

4

5

NNP

NNP

NNP

NNP

NNP

11

3

4

5

NNP

NNP

NNP

NNP

NNP

1

2

3

4

5

IN

IN

IN

IN

IN22

3verb.stative

verb.stative

verb.stative

1

2

3verb.social

verb.social

verb.social

1

2

3verb.possession

verb.possession

verb.possession

1

2



chairman

of Elsevier

is

Mr. Vinken N.V.

3

4

5

NNP

NNP

NNP

NNP

NNP

1

2

3

4

5

NNP

NNP

NNP

NNP

NNP

1

2

3

4

5

NNP

NNP

NNP

NNP

NNP

1

2

3

4

5

NNP

NNP1

NNP

NNP

1

2

3

4

5

IN

IN

IN

IN

IN

1

2

3

noun.person

noun.person

noun.person

NNP2

3verb.stative

verb.stative

verb.stative

1

2

3verb.social

verb.social

verb.social

1

2

3verb.possession

verb.possession

verb.possession

1

2



is

Elsevier N.V.of

chairman

Mr. Vinken

5

NNP

NNP

NNP

NNP

NNP

1

2

3

4

5

NNP

NNP

NNP

NNP

NNP

1

2

3

4

5

NNP

NNP

NNP

NNP

NNP

1

2

3

4

5

IN

IN

IN1

IN

1

2

3verb.possession

verb.possession

verb.possession

1

2

3verb.social

verb.social

verb.social

1

2

3verb.stative

verb.stative

verb.stative

IN2

3

noun.person

noun.person

noun.person1

2

3

4

5

NNP

NNP

NNP

NNP

NNP

1

2

3

4



Vinken Elsevier N.V.of

is

chairman

Mr.

3

noun.person

noun.person

noun.person1

2

3

4

5

NNP

NNP

NNP

NNP

NNP

1

2

3

4

5

NNP

NNP

NNP

NNP

NNP

11

3

4

5

NNP

NNP

NNP

NNP

NNP

1

2

3

4

5

IN

IN

IN

IN

IN

1

2

3

4

5

NNP

NNP

NNP

NNP

NNP

22

3verb.stative

verb.stative

verb.stative

1

2

3verb.social

verb.social

verb.social

1

2

3verb.possession

verb.possession

verb.possession

1

2


Hidden–value domains that didn’t work well

Word clustering without part-of-speech subdivisions

WordNet hyper/hyponym ontology

Domains containing mixed values

Examples of features

Elsevier

headed by chairmanThe highest nonterminal

2

NN3

IN1

IN3

NNP1

NNP3 NNP

NNP2

3

NNP1

IN2 NNP2

NNP1

Vinken of N.V.chairman

VB1

isMr.

NNP2

NP(chairman)

NNP1

NNP3

NNP

NNP2

3

VB

VB2

3

NN

NN1

S

NNP IN NNP NNP

NP NP

PP

VP

VB NN

NP

NNP

(NN3, Word= chairman, Highest Nonterminal= NP) ∈ φ(t i,j, chairman, NN3)

Examples of features

The governing rule

NN1

2

NN3

IN1

IN3

NNP1

NNP3 NNP

NNP2

3

NNP1

IN2 NNP2

NNP1

Vinken of Elsevier N.V.Mr.

NNP2

VB1

is chairman

NNP1

NNP3

NNP

NNP2

3

VB

VB2

3

NN

NP NP

PP

NN

NP

NNP

S

NP

VP

VBNNP IN NNP NNP

(VB1, NN3, Rule=VP →VB NP) ∈ φ(t i,j, is, chairman, VB1, NN3)

Overview of talk

Motivation

Design

Results

Discussion

Conclusion

Experimental Setup

N-best lists generated by Collins parser, ni ≈ 30

Training set: WSJ sections 2–21

Development set: WSJ section 0

Test set: WSJ sections 22-24

Final test models

Two baseline models

The Collins (1999) base parser

The Collins (2000) reranker

Two mixed models

MIX combined clustering, refinement, and WordNet

MIX+ augments MIX with features of Collins reranker

Results on Sections 22–24

LR LP F 1

Collins parser 88.19 88.60 88.39MIX 89.41 89.87 89.64Collins reranker 89.46 90.07 89.76MIX+ 89.78 90.29 90.03

All comparisons except MIX vs. Collins rerankerare significant at p ≤ 0.01 using the sign test


LR LP F 1




LR LP F 1



Overview of talk

Motivation

Design

Results

Discussion

Conclusion

Previous work

Parsing approaches that use hidden variables

Riezler et al. (2002)Clark and Curran (2004)Matsuzaki et al. (2005)

Differences with our approach

Use of rerankingDefinition of hidden variablesUse of belief propagation

Using packed representations

Candidates t i,j represented as a packed forest

Compact representation of many parse trees

Packed representation forces local scope

Features would become locally scoped on non-hiddeninformation

Decoding becomes NP-hard, must approximate withViterbi (cf. Matsuzaki et al., 2005):

argmax p(t i,j| si, Θ) ≈ argmax p(t i,j, h | si, Θ)t i,j t i,j ,h

Empirical analysis of hiddenvalues

The model makes hidden–value assignments onthe basis of the reranking criterion

i.e. maximize∑

log p(t i,1 | si , Θ)

The empirical distribution of assignments showslinguistically reasonable trends

Overview of talk

Motivation

Design

Results

Discussion

Conclusion

Concluding remarks

The hidden–variable model defines a newrepresentation for NLP structures

Conditional log–linear model with hiddenvariables

BP enables efficient and exact training anddecoding

Significant improvement over Collins (2000)

Example

Example

I [will [give/VB 2 an example]]

I expected [to [give/VB 1 an example]]

I expected [to/TO 4 [give an example]]

You expected [me [to/TO 1,5 [give an example]]]

Documents

Terry Koo and Michael Collins - Massachusetts Institute of ...people.csail.mit.edu/maestro/papers/koo05emnlp-slides.pdf · Representing NLP structures Proper representation is critical