A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance Andrew McCallum Kedar Bellare Fernando Pereira Thanks to Charles

A Conditional Random Field for Discriminatively-trained

Finite-state String Edit Distance

Andrew McCallum

Kedar Bellare

Fernando Pereira

Thanks to Charles Sutton, Xuerui Wang and Mikhail Bilenko for helpful discussions.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

String Edit Distance

• Distance between sequences x and y:– “cost” of lowest-cost sequence of edit operations

that transform string x into y.




• Applications– Database Record Deduplication

Apex International Hotel Grassmarket Street

Apex Internat’l Grasmarket Street

Records are duplicates of the same hotel?





– Biological Sequences

AGCTCTTACGATAGAGGACTCCAGA

AGGTCTTACCAAAGAGGACTTCAGAQuickTime™ and a

TIFF (Uncompressed) decompressorare needed to see this picture.








– Machine Translation

Il a achete une pomme

He bought an apple






– Machine Translation

– Textual Entailment He bought a new car last night

He purchased a brand new automobile yesterday evening

Levenshtein Distance

copy Copy a character from x to y (cost 0)insert Insert a character into y (cost 1)delete Delete a character from y (cost 1)subst Substitute one character for another (cost 1)

Edit operations

Lowest costalignment

W i l l i a m _ W . _ C o h o n

W i l l l e a m _ C o h e n

cop

y

cop

y

cop

y

cop

y

sub

st

cop

y

cop

y

cop

y

cop

y

inse

rt

cop

y

dele

te

dele

te

sub

st

cop

y

cop

y

operation cost

Total cost = 6= Levenshtein Distance

dele

te

0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 1 0

Align two strings William W. CohonWillleam Cohen

x1 =

x2 =

[1966]

Levenshtein Distance

copy Copy a character from x to y (cost 0)insert Insert a character into y (cost 1)delete Delete a character from y (cost 1)subst Substitute one character for another (cost 1)

Edit operations

W i l l l e a m 0 1 2 3 4 5 6 7 8W 1 0 1 2 3 4 5 6 7i 2 1 0 1 2 3 4 5 6l 3 2 1 0 1 2 3 4 5l 4 3 2 1 0 1 2 3 4i 5 4 3 2 1 1 2 3 4a 6 5 4 3 2 2 2 2 4m 7 6 5 4 3 3 3 3 2

insert

subst

D(i,j) = score of best alignment from x1... xi to y1... yj.

D(i-1,j-1) + (xi≠yj )D(i,j) = min D(i-1,j) + 1

D(i,j-1) + 1

Dynamic program

total cost =distance

Levenshtein Distancewith Markov Dependencies

Cost after a c i d scopy Copy a character from x to y 0 0 0 0insert Insert a character into y 1 1 1delete Delete a character from y 1 1 1 subst Substitute one character for another 1 1 1 1

Edit operations

W i l l l e a m 0 1 2 3 4 5 6 7 8W 1 0 1 2 3 4 5 6 7i 2 1 0 1 2 3 4 5 6l 3 2 1 0 1 2 3 4 5l 4 3 2 1 0 1 2 3 4i 5 4 3 2 1 1 2 3 4a 6 5 4 3 2 2 2 2 4m 7 6 5 4 3 3 3 3 2

Learn these costsfrom training data

subst

insertdelete

3DDPtable

repeateddelete

is cheaper

copy

12

12

Ristad & Yianilos (1997)Essentially a Pair-HMM,generating a edit/state/alignment-sequence and two strings

€

p(a,x1,x2) = p(at | at−1)p(x1,a t .i1, x2,a t .i2

| at )t

∏ complete data likelihood

Learn via EM: Expectation step: Calculate likelihood of alignment paths Maximization step: Make those paths more likely.



cop

y

cop

y

cop

y

cop

y

sub

st

cop

y

cop

y

cop

y

cop

y

inse

rt

cop

y

dele

te

dele

te

sub

st

cop

y

cop

y

dele

te

1 2 3 4 4 5 6 7 8 9 10 11 12 13 14 15 16

1 2 3 4 5 6 7 8 8 8 8 9 10 11 12 13 14

x1

x2

a.i1

a.ea.i2

string 1

alignment

string 2

€

p(x1,x2) = p(at | at−1)p(x1,a t .i1, x2,a t .i2

| at )t

∏a:x1 ,x 2

∑ incomplete data likelihood(sum over all alignments consistent with x1 and x2)

Match score =

€

O = p(x1

( j ),x2

( j ))j

∏Given training set ofmatching string pairs, objective fn is

Ristad & Yianilos Regrets

• Limited features of input strings– Examine only single character pair at a time– Difficult to use upcoming string context, lexicons, ...– Example: “Senator John Green” “John Green”

• Limited edit operations– Difficult to generate arbitrary jumps in both strings– Example: “UMass” “University of Massachusetts”.

• Trained only on positive match data– Doesn’t include information-rich “near misses”– Example: “ACM SIGIR” ≠ “ACM SIGCHI”

So, consider model trained by conditional probability

Conditional Probability (Sequence) Models

• We prefer a model that is trained to maximize a conditional probability rather than joint probability:P(y|x) instead of P(y,x):

– Can examine features, but not responsible for generating them.

– Don’t have to explicitly model their dependencies.

Jointyt-1 yt

xt

yt+1

xt+1xt-1

...

...

[Lafferty, McCallum, Pereira 2001]

From HMMs to Conditional Random Fields

€

P(y,x) = P(y t | y t−1)P(x t | y t )t=1

|x |

∏

€

vs = s1,s2,...sn

v o = o1,o2,...on

Asian word segmentation [COLING’04], [ACL’04]IE from Research papers [HTL’04]Object classification in images [CVPR ‘04]

Wide-spread interest, positive experimental results in many applications.

Noun phrase, Named entity [HLT’03], [CoNLL’03]Protein structure prediction [ICML’04]IE from Bioinformatics text [Bioinformatics ‘04],…

Conditional

€

P(y | x) =1

P(x)P(y t | y t−1)P(x t | y t )

t=1

|v o |

∏yt-1 yt yt+1

xt xt+1xt-1

...

...

€

=1

Z(x)Φs(y t ,y t−1)Φo(x t , y t )

t=1

|x |

∏

(A super-special case of Conditional Random Fields.)

where

Set parameters by maximum likelihood, using optimization method on L.

€

Φo(x t , y t ) = exp λ k fk (y t ,x t )k

∑ ⎛

⎝ ⎜

⎞

⎠ ⎟

Linear-chain ^

CRF String Edit Distance



cop

y

cop

y

cop

y

cop

y

sub

st

cop

y

cop

y

cop

y

cop

y

inse

rt

cop

y

dele

te

dele

te

sub

st

cop

y

cop

y

dele

te

€

p(a | x1,x2) =1

Zx1 ,x 2

Φ(at ,at−1,x1,x2)t

∏

joint complete data likelihood

1 2 3 4 4 5 6 7 8 9 10 11 12 13 14 15 16

1 2 3 4 5 6 7 8 8 8 8 9 10 11 12 13 14

x1

x2

a.i1

a.ea.i2

string 1

alignment

string 2

conditional complete data likelihood

€

p(a,x1,x2) = p(at | at−1)p(x1,a t .i1,x2,a t .i2

| at )t

∏

Want to train from set of string pairs,each labeled one of {match, non-match}

match “William W. Cohon” “Willlleam Cohen”non-match “Bruce D’Ambrosio” “Bruce Croft”match “Tommi Jaakkola” “Tommi Jakola”match “Stuart Russell” “Stuart Russel”non-match “Tom Deitterich” “Tom Dean”

CRF String Edit Distance FSM

subst

insertdelete

copy


subst

insertdelete

copy

subst

insertdelete

copy

Start

matchm = 1

non-matchm = 0

€

p(m | x1,x2) =1

Zx1 ,x 2

Φ(at ,at−1,x1,x2)t

∏a∈Sm

∑conditional incomplete data likelihood


subst

insertdelete

copy

subst

insertdelete

copy

Start

matchm = 1

non-matchm = 0

Probability summed overall alignments in match states

0.8

Probability summed overall alignments in non-match states

0.2

x1 = “Tommi Jaakkola”x2 = “Tommi Jakola”


subst

insertdelete

copy

subst

insertdelete

copy

Start

matchm = 1

non-matchm = 0

Probability summed overall alignments in match states

0.1

Probability summed overall alignments in non-match states

0.9

x1 = “Tom Dietterich”x2 = “Tom Dean”

Parameter Estimation

Expectation Maximization• E-step: Estimate distribution over alignments,

, using current parameters• M-step: Change parameters to maximize the

complete (penalized) log likelihood, with an iterative quasi-Newton method (BFGS)

€

O = log p(m( j ) | x1

( j ),x2

( j ))( )j

∑Given training set ofstring pairs and match/non-match labels,objective fn is the incomplete log likelihood

The complete log likelihood

€

log p(m( j ) | a,x1

( j ),x2

( j ))p(a | x1

( j ),x2

( j ))( )a

∑j

∑

€

p(a | x1

( j ),x2

( j ))

This is “conditional EM”, but avoid complexities of [Jebara 1998], because no need to solve M-step in closed form.

Efficient Training

• Dynamic programming table is 3D;|x1| = |x2| = 100, |S| = 12, .... 120,000 entries

• Use beam search during E-step[Pal, Sutton, McCallum 2005]

• Unlike completely observed CRFs, objective function is not convex.

• Initialize parameters not at zero, but so as to yield a reasonable initial edit distance.

What Alignments are Learned?

subst

insertdelete

copy

subst

insertdelete

copy

Start

matchm = 1

non-matchm = 0

x1 = “Tommi Jaakkola”x2 = “Tommi Jakola”

T o m m i J a a k k o l a

Tommi

Jakola


subst

insertdelete

copy

subst

insertdelete

copy

Start

matchm = 1

non-matchm = 0

x1 = “Bruce Croft”x2 = “Tom Dean”

B r u c e C r o f t

Tom

Dean


subst

insertdelete

copy

subst

insertdelete

copy

Start

matchm = 1

non-matchm = 0

x1 = “Jaime Carbonell”x2 = “Jamie Callan”

J a i m e C a r b o n e l l

Jamie

Callan

Summary of Advantages

• Arbitrary features of the input strings– Examine past, future context– Use lexicons, WordNet

• Extremely flexible edit operations– Single operation may make arbitrary jumps in both

strings, of size determined by input features

• Discriminative Training– Maximize ability to predict match vs non-match

Experimental Results:Data Sets

• Restaurant name, Restaurant address– 864 records, 112 matches– E.g. “Abe’s Bar & Grill, E. Main St”

“Abe’s Grill, East Main Street”

• People names, UIS DB generator– synthetic noise– E.g. “John Smith” vs “Snith, John”

• CiteSeer Citations– In four sections: Reason, Face, Reinforce, Constraint– E.g. “Rusell & Norvig, “Artificial Intelligence: A Modern...”

“Russell & Norvig, “Artificial Intelligence: An Intro...”

Experimental Results:Features

• same, different• same-alphabetic, different alphbetic• same-numeric, different-numeric• punctuation1, punctuation2• alphabet-mismatch, numeric-mismatch• end-of-1, end-of-2• same-next-character, different-next-character

Experimental Results:Edit Operations

• insert, delete, substitute/copy• swap-two-characters• skip-word-if-in-lexicon• skip-parenthesized-words• skip-any-word• substitute-word-pairs-in-translation-lexicon• skip-word-if-present-in-other-string

Experimental Results

CiteSeerReason Face Reinf Constraint

0.927 0.952 0.893 0.9240.938 0.966 0.907 0.9410.897 0.922 0.903 0.9230.924 0.875 0.808 0.913

Restaurantname

0.2900.3540.3650.433

Restaurantaddress

0.6860.7120.3800.532

Distancemetric

LevenshteinLearned Leven.VectorLearned Vector

[Bilenko & Mooney 2003]

F1 (average of precision and recall)


CiteSeerReason Face Reinf Constraint

0.927 0.952 0.893 0.9240.938 0.966 0.907 0.9410.897 0.922 0.903 0.9230.924 0.875 0.808 0.913

0.964 0.918 0.917 0.976

Restaurantname

0.2900.3540.3650.433

0.448

Restaurantaddress

0.6860.7120.3800.532

0.783

Distancemetric

LevenshteinLearned Leven.VectorLearned Vector

CRF Edit Distance

[Bilenko & Mooney 2003]

F1 (average of precision and recall)


F1

0.8560.981

Without skip-if-present-in-other-stringWith skip-if-present-in-other-string

Data set: person names, with word-order noise added

Related Work

• Learned Edit Distance– [Bilenko & Mooney 2003], [Cohen et al 2003],...– [Joachims 2003]: Max-margin, trained on alignments

• Conditionally-trained models with latent variables– [Jebara 1999]: “Conditional Expectation Maximization”– [Quattoni, Collins, Darrell 2005]: CRF for visual object

recognition, with latent classes for object sub-patches– [Zettlemoyer & Collins 2005]: CRF for mapping

sentences to logical form, with latent parses.

“Predictive Random Fields”Latent Variable Models fit by

Multi-way Conditional Probability

• For clustering structured data,ala Latent Dirichlet Allocation & its successors

• But an undirected model,like the Harmonium [Welling, Rosen-Zvi, Hinton, 2005]

• But trained by a “multi-conditional” objective: O = P(A|B,C) P(B|A,C) P(C|A,B)e.g. A,B,C are different modalities

(c.f. “Predictive Likelihood”)

[McCallum, Wang, Pal, 2005]

Predictive Random Fieldsmixture of Gaussians on synthetic data



Data, classify by color Generatively trained

Conditionally-trained [Jebara 1998]

Predictive Random Field


Predictive Random Fieldsvs. Harmoniun

on document retrieval task

Harmonium, joint with words

Harmonium, joint,with class labels and words

Conditionally-trained,to predict class labels

Predictive Random Field,multi-way conditionally trained


Summary• String edit distance

– Widely used in many fields

• As in CRF sequence labeling, benefit by– conditional-probability training, and– ability to use arbitrary, non-independent input features

• Example of conditionally-trained model withlatent variables.– “Find the alignments that most help distinguish match from non-match.”– May ultimately want the alignments, but only have relatively-easier-to-

label +/- labels at training time: “Distantly-labeled data”, “semi-supervised learning”

• Future work: Edit distance on trees.

• See also “Predictive Random Fields”http://www.cs.umass.edu/~pal/PRFTR.pdf

End of talk

Documents

A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance Andrew McCallum Kedar Bellare Fernando Pereira Thanks to Charles