Upload
christian-morris
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
First-Order Probabilistic Models for Coreference Resolution
Aron Culotta
Computer Science Department
University of Massachusetts Amherst
Joint work with Andrew McCallum [advisor], Michael Wick, Robert Hall
Probabilistic First-Order Logic for Coreference Resolution
Aron Culotta
Computer Science Department
University of Massachusetts, Amherst
Joint work with Andrew McCallum [advisor], Michael Wick, Robert Hall
Previous work: Conditional Random Fields
for Coreference
A Pairwise Conditional Random Field for Coreference
. . . Mr Powell . . .
. . . Powell . . .
. . . she . . .
y
[McCallum & Wellner, 2003, ICML](PW-CRF)
y
y
x2
x3
x1
Coreferent(x2,x3)?
A Pairwise Conditional Random Field for Co-reference
. . . Mr Powell . . .
. . . Powell . . .
. . . she . . .
y
[McCallum & Wellner, 2003, ICML](PW-CRF)
€
P(y | x) =1
Zxexp λ l f l (x i,x j ,y ij ) + λ ' f '(y ij ,y jk,y ik )
i, j,k
∑l
∑i, j
∑ ⎛
⎝ ⎜ ⎜
⎞
⎠ ⎟ ⎟
y
y
x2
x3
x1
A Pairwise Conditional Random Field for Co-reference
. . . Mr Powell . . .
. . . Powell . . .
. . . she . . .
45
30y
[McCallum & Wellner, 2003, ICML](PW-CRF)
11
€
P(y | x) =1
Zxexp λ l f l (x i,x j ,y ij ) + λ ' f '(y ij ,y jk,y ik )
i, j,k
∑l
∑i, j
∑ ⎛
⎝ ⎜ ⎜
⎞
⎠ ⎟ ⎟
y
y
Pairwise compatibility score learned from
training data
x2
x3
x1
A Pairwise Conditional Random Field for Co-reference
. . . Mr Powell . . .
. . . Powell . . .
. . . she . . .
45
30y
[McCallum & Wellner, 2003, ICML](PW-CRF)
11
€
P(y | x) =1
Zxexp λ l f l (x i,x j ,y ij ) + λ ' f '(y ij ,y jk,y ik )
i, j,k
∑l
∑i, j
∑ ⎛
⎝ ⎜ ⎜
⎞
⎠ ⎟ ⎟
y
y
Pairwise compatibility score learned from
training data Hard transitivity constraints enforced by prediction algorithm
x2
x3
x1
. . . Mr Powell . . .
. . . Powell . . .
. . . she . . .
45
30
11
Prediction in PW-CRFs = Graph Partitioning
[Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002]
€
log P(y | x)( )∝ λ l f l (x i,x j ,y ij )l
∑i, j
∑ = w ij
i, j w/inparitions
∑ − w ij
i, j acrossparitions
∑ = 64
Often approximated with agglomerative clustering
x2
x3
x1
Parameter Estimation in PW-CRFs
• Given labeled documents, generate all pairs of mentions– Optionally prune distant mention pairs
[Soon, Ng, Lim 2001]
• Learn binary classifier to predict coreference
• Edge weights proportional to classifier output
Sometimes pairwise comparisons are insufficient
• Entities have multiple attributes (name, email, institution, location); need to measure “compatibility” among them.
• Having 2 “given names” is common, but not 4.– e.g. Howard M. Dean / Martin, Dean / Howard Martin
• Need to measure size of the clusters of mentions.
a pair of name strings where edit distance differs > 0.5?
• Maximum distance between mentions in document
• A entity contains only pronoun mentions?
We need measures on hypothesized “entities”We need First-order logic
First-Order Logic CRFs for Coreference
First-Order Logic CRFs for Co-reference
. . . Mr Powell . . .
. . . Powell . . .
. . . she . . .
(FOL-CRF)
y
56
x2
x3
x1
Coreferent(x1,x2,x3)?
First-Order Logic CRFs for Co-reference
. . . Mr Powell . . .
. . . Powell . . .
. . . she . . .
(FOL-CRF)
y
Clusterwise compatibility score learned from training data
Features are arbitrary FOL predicates over a set of mentions
€
P(y | x) =1
Zxexp λ l f l (Xi,y i)
l
∑X i ∈Ρ(x )
∑ ⎛
⎝ ⎜ ⎜
⎞
⎠ ⎟ ⎟
56
x2
x3
x1
Coreferent(x1,x2,x3)?
€
P(y | x) =1
Zxexp λ l f l (Xi,y i)
l
∑X i ∈Ρ(x )
∑ ⎛
⎝ ⎜ ⎜
⎞
⎠ ⎟ ⎟
First-Order Logic CRFs for Co-reference
. . . Mr Powell . . .
. . . Powell . . .
. . . she . . .
(FOL-CRF)
y
As in PW-CRF, prediction can be approximated with agglomerative clustering
56Coreferent(x1,x2,x3)?
x2
x3
x1
Learning Parameters of FOL-CRFs
• Generate classification examples where input is a set of mentions
• Unlike Pairwise CRF, cannot generate all possible examples in training data
He Powell Rice She he Secretary
Coreferent(x1,x2) …
Coreferent(x1,x2 ,x3) …
Coreferent(x1,x2 ,x3,x4)
Coreferent(x1,x2 ,x3,x4 ,x5)
Coreferent(x1,x2 ,x3,x4 ,x5 ,x6)
…
…
…
. . .
. . .
Combinatorial Explosion!
Learning Parameters of FOL-CRFs
This space complexity is common in probabilistic first-order logic
Gaifman 1964 Halpern 1990 Paskin 2002 Poole 2003
Richardson & Domingos 2006
Training in Probabilistic FOLParameter estimation; weight learning
• Input– First-order formulae
x S(x) T(x)
– Labeled data• a, b, c S(a), T(a), S(b), T(b), S(c)
• Output– Weights for each formula
x S(x) T(x) [0.67]
xy Coreferent(x,y) Pronoun(x)
xy Coreferent(x,y) Pronoun(x) [-2.3]
Training in Probabilistic FOLPrevious Work
• Maximum likelihood– Require intractable normalization constant
• Pseudo-likelihood [Richardson, Domingos 2006]
– Ignores uncertainty of relational information
• E-M [Kersting, De Raedt 2001; Koller, Pfeffer 1997]
• Sampling [Paskin 2002]
• Perceptron [Singla, Domingos 2005]
– Can be inefficient when prediction is expensive
• Piecewise training [Sutton, McCallum 2005]
– Train “pieces” of world in isolation– Performance sensitive to which pieces are chosen
• Most methods require “unrolling” [grounding]
• Unrolling has exponential space complexity– E.g., xyz S(x,y,z) -> T(x,y,z)
• For constants [a b c d e f g h] must examine all triples
• Sampling can be inefficient due to large sample space.
• Proposal: Let prediction errors guide sampling
Training in Probabilistic FOLParameter estimation; weight learning
Error-driven Training
• Input– Observed data X // Input mentions
– True labeling P // True clustering
– Prediction algorithm A // Clustering algorithm
– Initial weights W, prediction Q // Initial clustering
• Iterate until convergence– Q’ A(Q, W, O) // Merge clusters – If Q’ introduces an error
• UpdateWeights(Q, Q’, P, O, W)
– Else Q Q’
UpdateWeights(Q, Q’, P, O, W)Learning to Rank Pairs of Predictions
• Using truth P, generate a new Q’’ that is a better modification of Q than Q’.
• Update W s.t. Q’’ A(Q, W, O)
• Update parameters so Q’’ is ranked higher than Q’
Ranking vs Classification Training
• Instead of training
[Powell, Mr. Powell, he] --> YES[Powell, Mr. Powell, she] --> NO
• ...Rather...
[Powell, Mr. Powell, he] > [Powell, Mr. Powell, she]
• In general, higher-ranked example may contain errors
[Powell, Mr. Powell, George, he] > [Powell, Mr. Powell, George, she]
Ranking Parameter Update
In our experiments, we use a large-margin update based on MIRA [Crammer, Singer 2003]
Wt+1 = argminW ||Wt - W|| s.t. Score(Q’’, W) - Score(Q’, W) ≥ 1
Advantages• Never need to unroll entire network
– Only explore partial solutions prediction algorithm likely to produce
• Weights tuned for prediction algorithm
• Adaptable to different prediction algorithms– beam search, simulated annealing, etc.
• Adaptable to different loss functions
Related:• Incremental Perceptron [Collins, Roark 2004]
• LaSO [Daume, Marcu 2005]
Extended here for FOL, ranking, max-margin loss.
Rank partial, possibly mistaken predictions.
Disadvantages
• Difficult to analyze exactly what global objective function is being optimized
• Convergence issues– Average weight updates
Experiments
• ACE 2004 coreference– 443 newswire documents
• Standard feature set [Soon, Ng, Lim 2001; Ng & Cardie 2002]
– Text match, gender, number, context, Wordnet
• Additional first-order features– Min/Max/Average/Majority of pairwise features
• E.g., Average string edit distance, Max document distance
– Existential/Universal quantifications of pairwise features• E.g., There exists gender disagreement
• Prediction: Greedy agglomerative clustering
Experiments
Sampling + Classification
Error-driven + Ranking
FOL-CRF 69.2 79.3
PW-CRF 62.4 72.5
B-Cubed F1 Score on ACE 2004 Noun Coreference
[to our knowledge, best previously reported results ~ 69% (Ng, 2005)]
Better Representation
Better Training
Conclusions
Combining logical and probabilistic approaches to AI can improve state-of-the-art in NLP.
Simple approximations can make these approaches practical for real-world problems.
Future Work
• Fancier features – Over entire clusterings
• Less greedy inference – Metropolis-Hastings sampling
• Analysis of training– Which positive/negative examples to select when
updating– Loss function sensitive to local minima of prediction
• Analyze theoretical/empirical convergence
Thank you