First-Order Probabilistic Models for Coreference Resolution Aron Culotta Computer Science Department University of Massachusetts Amherst Joint work with

First-Order Probabilistic Models for Coreference Resolution

Aron Culotta

Computer Science Department

University of Massachusetts Amherst

Joint work with Andrew McCallum [advisor], Michael Wick, Robert Hall

Probabilistic First-Order Logic for Coreference Resolution

Aron Culotta

Computer Science Department

University of Massachusetts, Amherst

Joint work with Andrew McCallum [advisor], Michael Wick, Robert Hall

Previous work: Conditional Random Fields

for Coreference

A Pairwise Conditional Random Field for Coreference

. . . Mr Powell . . .

. . . Powell . . .

. . . she . . .

y

[McCallum & Wellner, 2003, ICML](PW-CRF)

y

y

x2

x3

x1

Coreferent(x2,x3)?

A Pairwise Conditional Random Field for Co-reference

. . . Mr Powell . . .

. . . Powell . . .

. . . she . . .

y


€

P(y | x) =1

Zxexp λ l f l (x i,x j ,y ij ) + λ ' f '(y ij ,y jk,y ik )

i, j,k

∑l

∑i, j

∑ ⎛

⎝ ⎜ ⎜

⎞

⎠ ⎟ ⎟

y

y

x2

x3

x1


. . . Mr Powell . . .

. . . Powell . . .

. . . she . . .

45

30y


11

€

P(y | x) =1


i, j,k

∑l

∑i, j

∑ ⎛

⎝ ⎜ ⎜

⎞

⎠ ⎟ ⎟

y

y

Pairwise compatibility score learned from

training data

x2

x3

x1


. . . Mr Powell . . .

. . . Powell . . .

. . . she . . .

45

30y


11

€

P(y | x) =1


i, j,k

∑l

∑i, j

∑ ⎛

⎝ ⎜ ⎜

⎞

⎠ ⎟ ⎟

y

y

Pairwise compatibility score learned from

training data Hard transitivity constraints enforced by prediction algorithm

x2

x3

x1

. . . Mr Powell . . .

. . . Powell . . .

. . . she . . .

45

30

11

Prediction in PW-CRFs = Graph Partitioning

[Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002]

€

log P(y | x)( )∝ λ l f l (x i,x j ,y ij )l

∑i, j

∑ = w ij

i, j w/inparitions

∑ − w ij

i, j acrossparitions

∑ = 64

Often approximated with agglomerative clustering

x2

x3

x1

Parameter Estimation in PW-CRFs

• Given labeled documents, generate all pairs of mentions– Optionally prune distant mention pairs

[Soon, Ng, Lim 2001]

• Learn binary classifier to predict coreference

• Edge weights proportional to classifier output

Sometimes pairwise comparisons are insufficient

• Entities have multiple attributes (name, email, institution, location); need to measure “compatibility” among them.

• Having 2 “given names” is common, but not 4.– e.g. Howard M. Dean / Martin, Dean / Howard Martin

• Need to measure size of the clusters of mentions.

a pair of name strings where edit distance differs > 0.5?

• Maximum distance between mentions in document

• A entity contains only pronoun mentions?

We need measures on hypothesized “entities”We need First-order logic

First-Order Logic CRFs for Coreference

First-Order Logic CRFs for Co-reference

. . . Mr Powell . . .

. . . Powell . . .

. . . she . . .

(FOL-CRF)

y

56

x2

x3

x1

Coreferent(x1,x2,x3)?


. . . Mr Powell . . .

. . . Powell . . .

. . . she . . .

(FOL-CRF)

y

Clusterwise compatibility score learned from training data

Features are arbitrary FOL predicates over a set of mentions

€

P(y | x) =1

Zxexp λ l f l (Xi,y i)

l

∑X i ∈Ρ(x )

∑ ⎛

⎝ ⎜ ⎜

⎞

⎠ ⎟ ⎟

56

x2

x3

x1

Coreferent(x1,x2,x3)?

€

P(y | x) =1

Zxexp λ l f l (Xi,y i)

l

∑X i ∈Ρ(x )

∑ ⎛

⎝ ⎜ ⎜

⎞

⎠ ⎟ ⎟


. . . Mr Powell . . .

. . . Powell . . .

. . . she . . .

(FOL-CRF)

y

As in PW-CRF, prediction can be approximated with agglomerative clustering

56Coreferent(x1,x2,x3)?

x2

x3

x1

Learning Parameters of FOL-CRFs

• Generate classification examples where input is a set of mentions

• Unlike Pairwise CRF, cannot generate all possible examples in training data

He Powell Rice She he Secretary

Coreferent(x1,x2) …

Coreferent(x1,x2 ,x3) …

Coreferent(x1,x2 ,x3,x4)

Coreferent(x1,x2 ,x3,x4 ,x5)

Coreferent(x1,x2 ,x3,x4 ,x5 ,x6)

…

…

…

. . .

. . .

Combinatorial Explosion!

Learning Parameters of FOL-CRFs

This space complexity is common in probabilistic first-order logic

Gaifman 1964 Halpern 1990 Paskin 2002 Poole 2003

Richardson & Domingos 2006

Training in Probabilistic FOLParameter estimation; weight learning

• Input– First-order formulae

x S(x) T(x)

– Labeled data• a, b, c S(a), T(a), S(b), T(b), S(c)

• Output– Weights for each formula

x S(x) T(x) [0.67]

xy Coreferent(x,y) Pronoun(x)

xy Coreferent(x,y) Pronoun(x) [-2.3]

Training in Probabilistic FOLPrevious Work

• Maximum likelihood– Require intractable normalization constant

• Pseudo-likelihood [Richardson, Domingos 2006]

– Ignores uncertainty of relational information

• E-M [Kersting, De Raedt 2001; Koller, Pfeffer 1997]

• Sampling [Paskin 2002]

• Perceptron [Singla, Domingos 2005]

– Can be inefficient when prediction is expensive

• Piecewise training [Sutton, McCallum 2005]

– Train “pieces” of world in isolation– Performance sensitive to which pieces are chosen

• Most methods require “unrolling” [grounding]

• Unrolling has exponential space complexity– E.g., xyz S(x,y,z) -> T(x,y,z)

• For constants [a b c d e f g h] must examine all triples

• Sampling can be inefficient due to large sample space.

• Proposal: Let prediction errors guide sampling

Training in Probabilistic FOLParameter estimation; weight learning

Error-driven Training

• Input– Observed data X // Input mentions

– True labeling P // True clustering

– Prediction algorithm A // Clustering algorithm

– Initial weights W, prediction Q // Initial clustering

• Iterate until convergence– Q’ A(Q, W, O) // Merge clusters – If Q’ introduces an error

• UpdateWeights(Q, Q’, P, O, W)

– Else Q Q’

UpdateWeights(Q, Q’, P, O, W)Learning to Rank Pairs of Predictions

• Using truth P, generate a new Q’’ that is a better modification of Q than Q’.

• Update W s.t. Q’’ A(Q, W, O)

• Update parameters so Q’’ is ranked higher than Q’

Ranking vs Classification Training

• Instead of training

[Powell, Mr. Powell, he] --> YES[Powell, Mr. Powell, she] --> NO

• ...Rather...

[Powell, Mr. Powell, he] > [Powell, Mr. Powell, she]

• In general, higher-ranked example may contain errors

[Powell, Mr. Powell, George, he] > [Powell, Mr. Powell, George, she]

Ranking Parameter Update

In our experiments, we use a large-margin update based on MIRA [Crammer, Singer 2003]

Wt+1 = argminW ||Wt - W|| s.t. Score(Q’’, W) - Score(Q’, W) ≥ 1

Advantages• Never need to unroll entire network

– Only explore partial solutions prediction algorithm likely to produce

• Weights tuned for prediction algorithm

• Adaptable to different prediction algorithms– beam search, simulated annealing, etc.

• Adaptable to different loss functions

Related:• Incremental Perceptron [Collins, Roark 2004]

• LaSO [Daume, Marcu 2005]

Extended here for FOL, ranking, max-margin loss.

Rank partial, possibly mistaken predictions.

Disadvantages

• Difficult to analyze exactly what global objective function is being optimized

• Convergence issues– Average weight updates

Experiments

• ACE 2004 coreference– 443 newswire documents

• Standard feature set [Soon, Ng, Lim 2001; Ng & Cardie 2002]

– Text match, gender, number, context, Wordnet

• Additional first-order features– Min/Max/Average/Majority of pairwise features

• E.g., Average string edit distance, Max document distance

– Existential/Universal quantifications of pairwise features• E.g., There exists gender disagreement

• Prediction: Greedy agglomerative clustering

Experiments

Sampling + Classification

Error-driven + Ranking

FOL-CRF 69.2 79.3

PW-CRF 62.4 72.5

B-Cubed F1 Score on ACE 2004 Noun Coreference

[to our knowledge, best previously reported results ~ 69% (Ng, 2005)]

Better Representation

Better Training

Conclusions

Combining logical and probabilistic approaches to AI can improve state-of-the-art in NLP.

Simple approximations can make these approaches practical for real-world problems.

Future Work

• Fancier features – Over entire clusterings

• Less greedy inference – Metropolis-Hastings sampling

• Analysis of training– Which positive/negative examples to select when

updating– Loss function sensitive to local minima of prediction

• Analyze theoretical/empirical convergence

Thank you

Documents

First-Order Probabilistic Models for Coreference Resolution Aron Culotta Computer Science Department University of Massachusetts Amherst Joint work with