36
A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance Andrew McCallum Kedar Bellare Fernando Pereira nks to Charles Sutton, Xuerui Wang and Mikhail Bilenko for helpful discussio QuickTime™ and a TIFF (Uncompressed) d are needed to see th

A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance Andrew McCallum Kedar Bellare Fernando Pereira Thanks to Charles

Embed Size (px)

Citation preview

Page 1: A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance Andrew McCallum Kedar Bellare Fernando Pereira Thanks to Charles

A Conditional Random Field for Discriminatively-trained

Finite-state String Edit Distance

Andrew McCallum

Kedar Bellare

Fernando Pereira

Thanks to Charles Sutton, Xuerui Wang and Mikhail Bilenko for helpful discussions.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 2: A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance Andrew McCallum Kedar Bellare Fernando Pereira Thanks to Charles

String Edit Distance

• Distance between sequences x and y:– “cost” of lowest-cost sequence of edit operations

that transform string x into y.

Page 3: A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance Andrew McCallum Kedar Bellare Fernando Pereira Thanks to Charles

String Edit Distance

• Distance between sequences x and y:– “cost” of lowest-cost sequence of edit operations

that transform string x into y.

• Applications– Database Record Deduplication

Apex International Hotel Grassmarket Street

Apex Internat’l Grasmarket Street

Records are duplicates of the same hotel?

Page 4: A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance Andrew McCallum Kedar Bellare Fernando Pereira Thanks to Charles

String Edit Distance

• Distance between sequences x and y:– “cost” of lowest-cost sequence of edit operations

that transform string x into y.

• Applications– Database Record Deduplication

– Biological Sequences

AGCTCTTACGATAGAGGACTCCAGA

AGGTCTTACCAAAGAGGACTTCAGAQuickTime™ and a

TIFF (Uncompressed) decompressorare needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 5: A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance Andrew McCallum Kedar Bellare Fernando Pereira Thanks to Charles

String Edit Distance

• Distance between sequences x and y:– “cost” of lowest-cost sequence of edit operations

that transform string x into y.

• Applications– Database Record Deduplication

– Biological Sequences

– Machine Translation

Il a achete une pomme

He bought an apple

Page 6: A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance Andrew McCallum Kedar Bellare Fernando Pereira Thanks to Charles

String Edit Distance

• Distance between sequences x and y:– “cost” of lowest-cost sequence of edit operations

that transform string x into y.

• Applications– Database Record Deduplication

– Biological Sequences

– Machine Translation

– Textual Entailment He bought a new car last night

He purchased a brand new automobile yesterday evening

Page 7: A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance Andrew McCallum Kedar Bellare Fernando Pereira Thanks to Charles

Levenshtein Distance

copy Copy a character from x to y (cost 0)insert Insert a character into y (cost 1)delete Delete a character from y (cost 1)subst Substitute one character for another (cost 1)

Edit operations

Lowest costalignment

W i l l i a m _ W . _ C o h o n

W i l l l e a m _ C o h e n

cop

y

cop

y

cop

y

cop

y

sub

st

cop

y

cop

y

cop

y

cop

y

inse

rt

cop

y

dele

te

dele

te

sub

st

cop

y

cop

y

operation cost

Total cost = 6= Levenshtein Distance

dele

te

0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 1 0

Align two strings William W. CohonWillleam Cohen

x1 =

x2 =

[1966]

Page 8: A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance Andrew McCallum Kedar Bellare Fernando Pereira Thanks to Charles

Levenshtein Distance

copy Copy a character from x to y (cost 0)insert Insert a character into y (cost 1)delete Delete a character from y (cost 1)subst Substitute one character for another (cost 1)

Edit operations

W i l l l e a m 0 1 2 3 4 5 6 7 8W 1 0 1 2 3 4 5 6 7i 2 1 0 1 2 3 4 5 6l 3 2 1 0 1 2 3 4 5l 4 3 2 1 0 1 2 3 4i 5 4 3 2 1 1 2 3 4a 6 5 4 3 2 2 2 2 4m 7 6 5 4 3 3 3 3 2

insert

subst

D(i,j) = score of best alignment from x1... xi to y1... yj.

D(i-1,j-1) + (xi≠yj )D(i,j) = min D(i-1,j) + 1

D(i,j-1) + 1

Dynamic program

total cost =distance

Page 9: A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance Andrew McCallum Kedar Bellare Fernando Pereira Thanks to Charles

Levenshtein Distancewith Markov Dependencies

Cost after a c i d scopy Copy a character from x to y 0 0 0 0insert Insert a character into y 1 1 1delete Delete a character from y 1 1 1 subst Substitute one character for another 1 1 1 1

Edit operations

W i l l l e a m 0 1 2 3 4 5 6 7 8W 1 0 1 2 3 4 5 6 7i 2 1 0 1 2 3 4 5 6l 3 2 1 0 1 2 3 4 5l 4 3 2 1 0 1 2 3 4i 5 4 3 2 1 1 2 3 4a 6 5 4 3 2 2 2 2 4m 7 6 5 4 3 3 3 3 2

Learn these costsfrom training data

subst

insertdelete

3DDPtable

repeateddelete

is cheaper

copy

12

12

Page 10: A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance Andrew McCallum Kedar Bellare Fernando Pereira Thanks to Charles

Ristad & Yianilos (1997)Essentially a Pair-HMM,generating a edit/state/alignment-sequence and two strings

p(a,x1,x2) = p(at | at−1)p(x1,a t .i1, x2,a t .i2

| at )t

∏ complete data likelihood

Learn via EM: Expectation step: Calculate likelihood of alignment paths Maximization step: Make those paths more likely.

W i l l i a m _ W . _ C o h o n

W i l l l e a m _ C o h e n

cop

y

cop

y

cop

y

cop

y

sub

st

cop

y

cop

y

cop

y

cop

y

inse

rt

cop

y

dele

te

dele

te

sub

st

cop

y

cop

y

dele

te

1 2 3 4 4 5 6 7 8 9 10 11 12 13 14 15 16

1 2 3 4 5 6 7 8 8 8 8 9 10 11 12 13 14

x1

x2

a.i1

a.ea.i2

string 1

alignment

string 2

p(x1,x2) = p(at | at−1)p(x1,a t .i1, x2,a t .i2

| at )t

∏a:x1 ,x 2

∑ incomplete data likelihood(sum over all alignments consistent with x1 and x2)

Match score =

O = p(x1

( j ),x2

( j ))j

∏Given training set ofmatching string pairs, objective fn is

Page 11: A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance Andrew McCallum Kedar Bellare Fernando Pereira Thanks to Charles

Ristad & Yianilos Regrets

• Limited features of input strings– Examine only single character pair at a time– Difficult to use upcoming string context, lexicons, ...– Example: “Senator John Green” “John Green”

• Limited edit operations– Difficult to generate arbitrary jumps in both strings– Example: “UMass” “University of Massachusetts”.

• Trained only on positive match data– Doesn’t include information-rich “near misses”– Example: “ACM SIGIR” ≠ “ACM SIGCHI”

So, consider model trained by conditional probability

Page 12: A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance Andrew McCallum Kedar Bellare Fernando Pereira Thanks to Charles

Conditional Probability (Sequence) Models

• We prefer a model that is trained to maximize a conditional probability rather than joint probability:P(y|x) instead of P(y,x):

– Can examine features, but not responsible for generating them.

– Don’t have to explicitly model their dependencies.

Page 13: A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance Andrew McCallum Kedar Bellare Fernando Pereira Thanks to Charles

Jointyt-1 yt

xt

yt+1

xt+1xt-1

...

...

[Lafferty, McCallum, Pereira 2001]

From HMMs to Conditional Random Fields

P(y,x) = P(y t | y t−1)P(x t | y t )t=1

|x |

vs = s1,s2,...sn

v o = o1,o2,...on

Asian word segmentation [COLING’04], [ACL’04]IE from Research papers [HTL’04]Object classification in images [CVPR ‘04]

Wide-spread interest, positive experimental results in many applications.

Noun phrase, Named entity [HLT’03], [CoNLL’03]Protein structure prediction [ICML’04]IE from Bioinformatics text [Bioinformatics ‘04],…

Conditional

P(y | x) =1

P(x)P(y t | y t−1)P(x t | y t )

t=1

|v o |

∏yt-1 yt yt+1

xt xt+1xt-1

...

...

=1

Z(x)Φs(y t ,y t−1)Φo(x t , y t )

t=1

|x |

(A super-special case of Conditional Random Fields.)

where

Set parameters by maximum likelihood, using optimization method on L.

Φo(x t , y t ) = exp λ k fk (y t ,x t )k

∑ ⎛

⎝ ⎜

⎠ ⎟

Linear-chain ^

Page 14: A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance Andrew McCallum Kedar Bellare Fernando Pereira Thanks to Charles

CRF String Edit Distance

W i l l i a m _ W . _ C o h o n

W i l l l e a m _ C o h e n

cop

y

cop

y

cop

y

cop

y

sub

st

cop

y

cop

y

cop

y

cop

y

inse

rt

cop

y

dele

te

dele

te

sub

st

cop

y

cop

y

dele

te

p(a | x1,x2) =1

Zx1 ,x 2

Φ(at ,at−1,x1,x2)t

joint complete data likelihood

1 2 3 4 4 5 6 7 8 9 10 11 12 13 14 15 16

1 2 3 4 5 6 7 8 8 8 8 9 10 11 12 13 14

x1

x2

a.i1

a.ea.i2

string 1

alignment

string 2

conditional complete data likelihood

p(a,x1,x2) = p(at | at−1)p(x1,a t .i1,x2,a t .i2

| at )t

Want to train from set of string pairs,each labeled one of {match, non-match}

match “William W. Cohon” “Willlleam Cohen”non-match “Bruce D’Ambrosio” “Bruce Croft”match “Tommi Jaakkola” “Tommi Jakola”match “Stuart Russell” “Stuart Russel”non-match “Tom Deitterich” “Tom Dean”

Page 15: A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance Andrew McCallum Kedar Bellare Fernando Pereira Thanks to Charles

CRF String Edit Distance FSM

subst

insertdelete

copy

Page 16: A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance Andrew McCallum Kedar Bellare Fernando Pereira Thanks to Charles

CRF String Edit Distance FSM

subst

insertdelete

copy

subst

insertdelete

copy

Start

matchm = 1

non-matchm = 0

p(m | x1,x2) =1

Zx1 ,x 2

Φ(at ,at−1,x1,x2)t

∏a∈Sm

∑conditional incomplete data likelihood

Page 17: A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance Andrew McCallum Kedar Bellare Fernando Pereira Thanks to Charles

CRF String Edit Distance FSM

subst

insertdelete

copy

subst

insertdelete

copy

Start

matchm = 1

non-matchm = 0

Probability summed overall alignments in match states

0.8

Probability summed overall alignments in non-match states

0.2

x1 = “Tommi Jaakkola”x2 = “Tommi Jakola”

Page 18: A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance Andrew McCallum Kedar Bellare Fernando Pereira Thanks to Charles

CRF String Edit Distance FSM

subst

insertdelete

copy

subst

insertdelete

copy

Start

matchm = 1

non-matchm = 0

Probability summed overall alignments in match states

0.1

Probability summed overall alignments in non-match states

0.9

x1 = “Tom Dietterich”x2 = “Tom Dean”

Page 19: A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance Andrew McCallum Kedar Bellare Fernando Pereira Thanks to Charles

Parameter Estimation

Expectation Maximization• E-step: Estimate distribution over alignments,

, using current parameters• M-step: Change parameters to maximize the

complete (penalized) log likelihood, with an iterative quasi-Newton method (BFGS)

O = log p(m( j ) | x1

( j ),x2

( j ))( )j

∑Given training set ofstring pairs and match/non-match labels,objective fn is the incomplete log likelihood

The complete log likelihood

log p(m( j ) | a,x1

( j ),x2

( j ))p(a | x1

( j ),x2

( j ))( )a

∑j

p(a | x1

( j ),x2

( j ))

This is “conditional EM”, but avoid complexities of [Jebara 1998], because no need to solve M-step in closed form.

Page 20: A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance Andrew McCallum Kedar Bellare Fernando Pereira Thanks to Charles

Efficient Training

• Dynamic programming table is 3D;|x1| = |x2| = 100, |S| = 12, .... 120,000 entries

• Use beam search during E-step[Pal, Sutton, McCallum 2005]

• Unlike completely observed CRFs, objective function is not convex.

• Initialize parameters not at zero, but so as to yield a reasonable initial edit distance.

Page 21: A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance Andrew McCallum Kedar Bellare Fernando Pereira Thanks to Charles

What Alignments are Learned?

subst

insertdelete

copy

subst

insertdelete

copy

Start

matchm = 1

non-matchm = 0

x1 = “Tommi Jaakkola”x2 = “Tommi Jakola”

T o m m i J a a k k o l a

Tommi

Jakola

Page 22: A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance Andrew McCallum Kedar Bellare Fernando Pereira Thanks to Charles

What Alignments are Learned?

subst

insertdelete

copy

subst

insertdelete

copy

Start

matchm = 1

non-matchm = 0

x1 = “Bruce Croft”x2 = “Tom Dean”

B r u c e C r o f t

Tom

Dean

Page 23: A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance Andrew McCallum Kedar Bellare Fernando Pereira Thanks to Charles

What Alignments are Learned?

subst

insertdelete

copy

subst

insertdelete

copy

Start

matchm = 1

non-matchm = 0

x1 = “Jaime Carbonell”x2 = “Jamie Callan”

J a i m e C a r b o n e l l

Jamie

Callan

Page 24: A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance Andrew McCallum Kedar Bellare Fernando Pereira Thanks to Charles

Summary of Advantages

• Arbitrary features of the input strings– Examine past, future context– Use lexicons, WordNet

• Extremely flexible edit operations– Single operation may make arbitrary jumps in both

strings, of size determined by input features

• Discriminative Training– Maximize ability to predict match vs non-match

Page 25: A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance Andrew McCallum Kedar Bellare Fernando Pereira Thanks to Charles

Experimental Results:Data Sets

• Restaurant name, Restaurant address– 864 records, 112 matches– E.g. “Abe’s Bar & Grill, E. Main St”

“Abe’s Grill, East Main Street”

• People names, UIS DB generator– synthetic noise– E.g. “John Smith” vs “Snith, John”

• CiteSeer Citations– In four sections: Reason, Face, Reinforce, Constraint– E.g. “Rusell & Norvig, “Artificial Intelligence: A Modern...”

“Russell & Norvig, “Artificial Intelligence: An Intro...”

Page 26: A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance Andrew McCallum Kedar Bellare Fernando Pereira Thanks to Charles

Experimental Results:Features

• same, different• same-alphabetic, different alphbetic• same-numeric, different-numeric• punctuation1, punctuation2• alphabet-mismatch, numeric-mismatch• end-of-1, end-of-2• same-next-character, different-next-character

Page 27: A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance Andrew McCallum Kedar Bellare Fernando Pereira Thanks to Charles

Experimental Results:Edit Operations

• insert, delete, substitute/copy• swap-two-characters• skip-word-if-in-lexicon• skip-parenthesized-words• skip-any-word• substitute-word-pairs-in-translation-lexicon• skip-word-if-present-in-other-string

Page 28: A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance Andrew McCallum Kedar Bellare Fernando Pereira Thanks to Charles

Experimental Results

CiteSeerReason Face Reinf Constraint

0.927 0.952 0.893 0.9240.938 0.966 0.907 0.9410.897 0.922 0.903 0.9230.924 0.875 0.808 0.913

Restaurantname

0.2900.3540.3650.433

Restaurantaddress

0.6860.7120.3800.532

Distancemetric

LevenshteinLearned Leven.VectorLearned Vector

[Bilenko & Mooney 2003]

F1 (average of precision and recall)

Page 29: A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance Andrew McCallum Kedar Bellare Fernando Pereira Thanks to Charles

Experimental Results

CiteSeerReason Face Reinf Constraint

0.927 0.952 0.893 0.9240.938 0.966 0.907 0.9410.897 0.922 0.903 0.9230.924 0.875 0.808 0.913

0.964 0.918 0.917 0.976

Restaurantname

0.2900.3540.3650.433

0.448

Restaurantaddress

0.6860.7120.3800.532

0.783

Distancemetric

LevenshteinLearned Leven.VectorLearned Vector

CRF Edit Distance

[Bilenko & Mooney 2003]

F1 (average of precision and recall)

Page 30: A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance Andrew McCallum Kedar Bellare Fernando Pereira Thanks to Charles

Experimental Results

F1

0.8560.981

Without skip-if-present-in-other-stringWith skip-if-present-in-other-string

Data set: person names, with word-order noise added

Page 31: A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance Andrew McCallum Kedar Bellare Fernando Pereira Thanks to Charles

Related Work

• Learned Edit Distance– [Bilenko & Mooney 2003], [Cohen et al 2003],...– [Joachims 2003]: Max-margin, trained on alignments

• Conditionally-trained models with latent variables– [Jebara 1999]: “Conditional Expectation Maximization”– [Quattoni, Collins, Darrell 2005]: CRF for visual object

recognition, with latent classes for object sub-patches– [Zettlemoyer & Collins 2005]: CRF for mapping

sentences to logical form, with latent parses.

Page 32: A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance Andrew McCallum Kedar Bellare Fernando Pereira Thanks to Charles

“Predictive Random Fields”Latent Variable Models fit by

Multi-way Conditional Probability

• For clustering structured data,ala Latent Dirichlet Allocation & its successors

• But an undirected model,like the Harmonium [Welling, Rosen-Zvi, Hinton, 2005]

• But trained by a “multi-conditional” objective: O = P(A|B,C) P(B|A,C) P(C|A,B)e.g. A,B,C are different modalities

(c.f. “Predictive Likelihood”)

[McCallum, Wang, Pal, 2005]

Page 33: A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance Andrew McCallum Kedar Bellare Fernando Pereira Thanks to Charles

Predictive Random Fieldsmixture of Gaussians on synthetic data

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Data, classify by color Generatively trained

Conditionally-trained [Jebara 1998]

Predictive Random Field

[McCallum, Wang, Pal, 2005]

Page 34: A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance Andrew McCallum Kedar Bellare Fernando Pereira Thanks to Charles

Predictive Random Fieldsvs. Harmoniun

on document retrieval task

Harmonium, joint with words

Harmonium, joint,with class labels and words

Conditionally-trained,to predict class labels

Predictive Random Field,multi-way conditionally trained

[McCallum, Wang, Pal, 2005]

Page 35: A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance Andrew McCallum Kedar Bellare Fernando Pereira Thanks to Charles

Summary• String edit distance

– Widely used in many fields

• As in CRF sequence labeling, benefit by– conditional-probability training, and– ability to use arbitrary, non-independent input features

• Example of conditionally-trained model withlatent variables.– “Find the alignments that most help distinguish match from non-match.”– May ultimately want the alignments, but only have relatively-easier-to-

label +/- labels at training time: “Distantly-labeled data”, “semi-supervised learning”

• Future work: Edit distance on trees.

• See also “Predictive Random Fields”http://www.cs.umass.edu/~pal/PRFTR.pdf

Page 36: A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance Andrew McCallum Kedar Bellare Fernando Pereira Thanks to Charles

End of talk