Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Hidden–Variable Models forDiscriminative Reranking
Terry Koo and Michael Collins{maestro|mcollins}@csail.mit.edu
Overview of reranking
The reranking approachUse a baseline model to get the N-best candidates“Rerank” the candidates using a more complex model
Parse rerankingCollins (2000): 88.2% ⇒ 89.8%Charniak and Johnson (2005): 89.7% ⇒ 91.0%Talk by Brooke Cowan in 7B: 83.6% ⇒ 85.1%
Also applied toMT (Och and Ney, 2002; Shen et al., 2004)NL Generation (Walker et al., 2001)
Representing NLP structures
Proper representation is critical to success
Hand–crafted feature vector representations
Φ( ) = {0, 1, 2, 0, 0, 3, 0, 1}
Features defined through kernels
K ( , ) = Φ( )·Φ( )
This talk: A new approach using hiddenvariables
Two facets of lexical items
Different lexical items can have similarmeanings, e.g. president and chairman
Clustering: president, chairman ∈ NounCluster 4
A single lexical item can have differentmeanings, e.g. [river] bank vs [financial] bank
Refinement: bank 1, bank 2 ∈ bank
Model clusterings and refinements as hiddenvariables that support the reranking task
Highlights of the approach
Conditional log–linear model with hiddenvariables
Dynamic programming is used for training anddecoding
Clustering and refinement done automaticallyusing a discriminative criterion
Overview of talk
Motivation
Design
General form of the model
Training and decoding efficiently
Creating specific instantiations
Results
Discussion
Conclusion
The parse reranking framework
Sentences si for 1 ≤ i ≤ n
s1: Pierre Vinken , 61 years old , will join ...
s2: Mr. Vinken is chairman of Elsevier N.V. ...
s3: Big Board Chairman John Phelan said yesterday ...
Each si has candidate parses t i,j for 1 ≤ j ≤ ni
t i,1 is the best candidate parse for si
The parse reranking framework
t i,j has phrase structure and dependency tree
Mr. Vinken is chairman of Elsevier N.V.
NNP
NP NP
PP
NP
VP
S
NP
NNPNNP VB NN IN NNP
The parse reranking framework
t i,j has phrase structure and dependency tree
Vinken is chairman of Elsevier N.V.Mr.
NNP VB NN IN NNP NNP
NPNP NP
PP
NP
VP
S
NNP
Adding hidden variables
Hidden–value domains Hw(t i,j) for 1 ≤ w ≤ len(si)
1NN
NN1
2
NN3
IN1
IN3
IN2
NNP1
NNP3
NNP2
NNP
NNP2
3
NNP
Mr. Vinken is chairman of Elsevier N.V.
NNP1
NNP3
NNP2 NNP
NNP2
3
NNP1
VB
VB2
3
VB1
S
NNPNNP VB NN IN NNP NNP
NPNP NP
PP
NP
VP
Adding hidden variables
Assignment h ∈ H1(t i,j) × ... × H len(si)(t i,j)
NNNNP1
NNP3
NNP2 NNP
NNP2
3
NNP1
VB
VB2
3
VB1
NN1
2
NN3 NNP
NNP2
3
NNP1NNP1
NNP3
NNP2
IN1
IN3
IN2
Mr. Vinken is chairman of Elsevier N.V.
NNPNNP VB NN IN NNP NNP
NPNP NP
PP
NP
VP
S
Marginalized probability model
Φ(t i,j, h) produces a descriptive vector of featureoccurrence counts, e.g.
Φ2(t i,j, h) = Count( chairman has hidden value NN 1)
Φ13(t i,j, h) = Count(NNP 2 is a direct object of VB 1)
Φ19(t i,j, h) = Count(NN 1 coordinates with NN 2)
Marginalized probability model
Log–linear distribution over ( t i,j, h) withparameters Θ:
p(t i,j , h | si , Θ) = eΦ(t i,j , h)·Θ∑
j′ ,h′
eΦ(t i,j′ , h′)·Θ
Marginalize over assignments h:
p(t i,j | si , Θ) =∑
hp(t i,j , h | si , Θ)
Optimizing the parameters
Define loss as negative log-likelihood
L(Θ) = -n∑
i=1log p(t i,1 | si , Θ)
Minimize L(Θ) through gradient descent
∂L∂Θ
= -∑
i
∑
hp(h | t i,1, si , Θ)Φ(t i,1, h)
+∑
i,jp(t i,j | si , Θ)
∑
hp(h | t i,j , si , Θ)Φ(t i,j , h)
Overview of talk
Motivation
Design
General form of the model
Training and decoding efficiently
Creating specific instantiations
Results
Discussion
Conclusion
Problems with efficiency
|H1(t i,j) × ... × H len(si)(t i,j)| grows exponentially, sotraining the model is intractable:
∂L∂Θ
= -∑
i
∑
hp(h | t i,1, si , Θ)Φ(t i,1, h)
+∑
i,jp(t i,j | si , Θ)
∑
hp(h | t i,j , si , Θ)Φ(t i,j , h)
Decoding the model is also intractable:
p(t i,j | si , Θ) =∑
hp(t i,j , h | si , Θ)
Problems with efficiency
|H1(t i,j) × ... × H len(si)(t i,j)| grows exponentially, sotraining the model is intractable:
∂L∂Θ
= -∑
i
∑
hp(h | t i,1, si , Θ)Φ(t i,1, h)
+∑
i,jp(t i,j | si , Θ)
∑
hp(h | t i,j , si , Θ)Φ(t i,j , h)
Decoding the model is also intractable:
p(t i,j | si , Θ) =∑
hp(t i,j , h | si , Θ)
Locality constraint on features
Features have pairwise local scope on hiddenvariables
Features still have global scope on non-hiddeninformation
Φ can be factored into local feature vectors,allowing dynamic programming
Local feature vectors
Define two kinds of local feature vector φ:
Single-variable φ(t i,j, w, hw) look at a single hiddenvariable
Pairwise φ(t i,j, u, v, hu, hv) look at two hidden variablesin a dependency relationship
Local feature vectors
Φ(t i,j, h) looks at every hidden variable
NNNNP1
NNP3
NNP2 NNP
NNP2
3
NNP1
VB
VB2
3
VB1
NN1
2
NN3 NNP
NNP2
3
NNP1NNP1
NNP3
NNP2
IN1
IN3
IN2
Mr. Vinken is chairman of Elsevier N.V.
NNPNNP VB NN IN NNP NNP
NPNP NP
PP
NP
VP
S
Local feature vectors
φ(t i,j, chairman, NN3) only sees NN 3
3
NN
NN1
2
NN
Mr. Vinken is chairman of Elsevier N.V.
NNPNNP VB NN IN NNP NNP
NPNP NP
PP
NP
VP
S
Local feature vectors
φ(t i,j, chairman, of, NN3, IN2) sees NN3 and IN2
1
IN3
IN2
NN
NN1
Mr. Vinken is chairman of Elsevier N.V.
2
NN3
IN
S
NNPNNP VB NN IN NNP NNP
NPNP NP
PP
NP
VP
Local feature vectors
Rewrite global Φ as a sum over local φ
Φ(t i,j , h) =∑
w ∈ ti,j
φ(t i,j , w, hw )
+∑
(u,v ) ∈ D(ti,j )φ(t i,j , u, v, hu, hv )
Local feature vectors
Rewrite global Φ as a sum over local φ
Φ(t i,j , h) =∑
w ∈ ti,j
φ(t i,j , w, hw )
+∑
(u,v ) ∈ D(ti,j )φ(t i,j , u, v, hu, hv )
Applying belief propagation
New restrictions enable dynamic–programmingapproaches, e.g. belief propagation
BP generalizes the forward–backward algorithm from achain to a tree
Runtime O(len(si)H2), H = max |Hw(t i,j)|
BP efficiently computes∑
hp(t i,j , h | si , Θ)
∑
hp(h | t i,j , si , Θ)Φ(t i,j , h)
Overview of talk
Motivation
Design
General form of the model
Training and decoding efficiently
Creating specific instantiations
Results
Discussion
Conclusion
Two areas for choice in the model
Definition of the hidden–value domains Hw(t i,j)
Definition of the feature vectors φ
Hidden–value domains
Lexical domains allow word refinement
chairmanMr. Vinken is of Elsevier N.V.
chairman
chairman Elsevier
Elsevier
Elsevier
1
2
3
1
2
3
1
3
1
2
3
Mr.
Mr.
Mr.
1
2
3
2 of
of
of 1
2
3
N.V.
N.V.
N.V.
1
2
3
Vinken
Vinken
Vinken
is
is
is
chairman
Hidden–value domains
Lexical domains allow word refinement
chairmanMr. Vinken is of Elsevier N.V.
Elsevier
Elsevier
1
2
3
1
2
3
1
2
3
Mr.
Mr.
Mr.
1
2
3
chairman
chairman 1
2
3chairman
of
of
of 1
2
3
N.V.
N.V.
N.V.
1
2
3
Vinken
Vinken
Vinken
is
is
is
Elsevier
Hidden–value domains
Part-of-speech domains allow word clustering
ElsevierMr. Vinken is chairman of N.V.
1
2
3
4
5
NNP
NNP
NNP
NNP
NNP
1
2
3
4
5
NNP
NNP
NNP
NNP
NNP
1
2
3
4
5
VB
VB
VB
VB
VB
NN
NN
NNP
NN
NN1
2
3
4
5
1
2
3
4
5
IN
IN
IN
IN
IN
NNNNP
NNP
NNP
NNP
1
2
3
4
5
1
2
3
4
5
NNP
NNP
NNP
NNP
NNP
Hidden–value domains
Part-of-speech domains allow word clustering
ElsevierMr. Vinken is of N.V.chairman
1
2
3
4
5
NNP
NNP
NNP
NNP
NNP
1
2
3
4
5
NNP
NNP
NNP
NNP
NNP
1
2
3
4
5
VB
VB
VB
VB
VB
1
2
NNP
4
5
IN
IN
IN
IN
IN
NN
NN
NN
NN
NN1
2
3
4
5
3
NNP
NNP
NNP
NNP
1
2
3
4
5
1
2
3
4
5
NNP
NNP
NNP
NNP
NNP
Hidden–value domains
Part-of-speech domains allow word clustering
Elsevieris chairman ofMr. Vinken N.V.
1
2
3
4
5
IN
IN
IN
IN
IN
1
2
3
4
5
NNP
NNP
NNP
NNP
NNP
NNP
NNP
NNP
NNP
NNP
1
2
3
4
5
1
2
1
4
5
NNP
NNP
NNP
NNP
NNP
1
2
3
4
5
NNP
NNP
NNP
NNP
NNP
3
2
3
4
5
VB
VB
VB
VB
VB
NN
NN
NN
NN
NN1
2
3
4
5
Hidden–value domains
Supersense domains model WordNet ontology(Ciaramita and Johnson, 2003; Miller et at., 1993)
Vinken Elsevier N.V.of
is
chairman
Mr.
3
noun.person
noun.person
noun.person1
2
3
4
5
NNP
NNP
NNP
NNP
NNP
1
2
3
4
5
NNP
NNP
NNP
NNP
NNP
1
2
3
4
5
NNP
NNP
NNP
NNP
NNP
11
3
4
5
NNP
NNP
NNP
NNP
NNP
1
2
3
4
5
IN
IN
IN
IN
IN22
3verb.stative
verb.stative
verb.stative
1
2
3verb.social
verb.social
verb.social
1
2
3verb.possession
verb.possession
verb.possession
1
2
Hidden–value domains
Supersense domains model WordNet ontology(Ciaramita and Johnson, 2003; Miller et at., 1993)
chairman
of Elsevier
is
Mr. Vinken N.V.
3
4
5
NNP
NNP
NNP
NNP
NNP
1
2
3
4
5
NNP
NNP
NNP
NNP
NNP
1
2
3
4
5
NNP
NNP
NNP
NNP
NNP
1
2
3
4
5
NNP
NNP1
NNP
NNP
1
2
3
4
5
IN
IN
IN
IN
IN
1
2
3
noun.person
noun.person
noun.person
NNP2
3verb.stative
verb.stative
verb.stative
1
2
3verb.social
verb.social
verb.social
1
2
3verb.possession
verb.possession
verb.possession
1
2
Hidden–value domains
Supersense domains model WordNet ontology(Ciaramita and Johnson, 2003; Miller et at., 1993)
is
Elsevier N.V.of
chairman
Mr. Vinken
5
NNP
NNP
NNP
NNP
NNP
1
2
3
4
5
NNP
NNP
NNP
NNP
NNP
1
2
3
4
5
NNP
NNP
NNP
NNP
NNP
1
2
3
4
5
IN
IN
IN1
IN
1
2
3verb.possession
verb.possession
verb.possession
1
2
3verb.social
verb.social
verb.social
1
2
3verb.stative
verb.stative
verb.stative
IN2
3
noun.person
noun.person
noun.person1
2
3
4
5
NNP
NNP
NNP
NNP
NNP
1
2
3
4
Hidden–value domains
Supersense domains model WordNet ontology(Ciaramita and Johnson, 2003; Miller et at., 1993)
Vinken Elsevier N.V.of
is
chairman
Mr.
3
noun.person
noun.person
noun.person1
2
3
4
5
NNP
NNP
NNP
NNP
NNP
1
2
3
4
5
NNP
NNP
NNP
NNP
NNP
11
3
4
5
NNP
NNP
NNP
NNP
NNP
1
2
3
4
5
IN
IN
IN
IN
IN
1
2
3
4
5
NNP
NNP
NNP
NNP
NNP
22
3verb.stative
verb.stative
verb.stative
1
2
3verb.social
verb.social
verb.social
1
2
3verb.possession
verb.possession
verb.possession
1
2
Hidden–value domains
Hidden–value domains that didn’t work well
Word clustering without part-of-speech subdivisions
WordNet hyper/hyponym ontology
Domains containing mixed values
Examples of features
Elsevier
headed by chairmanThe highest nonterminal
2
NN3
IN1
IN3
NNP1
NNP3 NNP
NNP2
3
NNP1
IN2 NNP2
NNP1
Vinken of N.V.chairman
VB1
isMr.
NNP2
NP(chairman)
NNP1
NNP3
NNP
NNP2
3
VB
VB2
3
NN
NN1
S
NNP IN NNP NNP
NP NP
PP
VP
VB NN
NP
NNP
(NN3, Word= chairman, Highest Nonterminal= NP) ∈ φ(t i,j, chairman, NN3)
Examples of features
The governing rule
NN1
2
NN3
IN1
IN3
NNP1
NNP3 NNP
NNP2
3
NNP1
IN2 NNP2
NNP1
Vinken of Elsevier N.V.Mr.
NNP2
VB1
is chairman
NNP1
NNP3
NNP
NNP2
3
VB
VB2
3
NN
NP NP
PP
NN
NP
NNP
S
NP
VP
VBNNP IN NNP NNP
(VB1, NN3, Rule=VP →VB NP) ∈ φ(t i,j, is, chairman, VB1, NN3)
Overview of talk
Motivation
Design
Results
Discussion
Conclusion
Experimental Setup
N-best lists generated by Collins parser, ni ≈ 30
Training set: WSJ sections 2–21
Development set: WSJ section 0
Test set: WSJ sections 22-24
Final test models
Two baseline models
The Collins (1999) base parser
The Collins (2000) reranker
Two mixed models
MIX combined clustering, refinement, and WordNet
MIX+ augments MIX with features of Collins reranker
Results on Sections 22–24
LR LP F 1
Collins parser 88.19 88.60 88.39MIX 89.41 89.87 89.64Collins reranker 89.46 90.07 89.76MIX+ 89.78 90.29 90.03
All comparisons except MIX vs. Collins rerankerare significant at p ≤ 0.01 using the sign test
Results on Sections 22–24
LR LP F 1
Collins parser 88.19 88.60 88.39MIX 89.41 89.87 89.64Collins reranker 89.46 90.07 89.76MIX+ 89.78 90.29 90.03
All comparisons except MIX vs. Collins rerankerare significant at p ≤ 0.01 using the sign test
Results on Sections 22–24
LR LP F 1
Collins parser 88.19 88.60 88.39MIX 89.41 89.87 89.64Collins reranker 89.46 90.07 89.76MIX+ 89.78 90.29 90.03
All comparisons except MIX vs. Collins rerankerare significant at p ≤ 0.01 using the sign test
Overview of talk
Motivation
Design
Results
Discussion
Conclusion
Previous work
Parsing approaches that use hidden variables
Riezler et al. (2002)Clark and Curran (2004)Matsuzaki et al. (2005)
Differences with our approach
Use of rerankingDefinition of hidden variablesUse of belief propagation
Using packed representations
Candidates t i,j represented as a packed forest
Compact representation of many parse trees
Packed representation forces local scope
Features would become locally scoped on non-hiddeninformation
Decoding becomes NP-hard, must approximate withViterbi (cf. Matsuzaki et al., 2005):
argmax p(t i,j| si, Θ) ≈ argmax p(t i,j, h | si, Θ)t i,j t i,j ,h
Empirical analysis of hiddenvalues
The model makes hidden–value assignments onthe basis of the reranking criterion
i.e. maximize∑
log p(t i,1 | si , Θ)
The empirical distribution of assignments showslinguistically reasonable trends
Overview of talk
Motivation
Design
Results
Discussion
Conclusion
Concluding remarks
The hidden–variable model defines a newrepresentation for NLP structures
Conditional log–linear model with hiddenvariables
BP enables efficient and exact training anddecoding
Significant improvement over Collins (2000)
Example
Example
I [will [give/VB 2 an example]]
I expected [to [give/VB 1 an example]]
I expected [to/TO 4 [give an example]]
You expected [me [to/TO 1,5 [give an example]]]