1 Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Cornell University HLT-NAACL 2003

1

Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment

Regina Barzilay and Lillian Lee

Cornell University

HLT-NAACL 2003 (22% 哇 !)

2

Objective

• Generate paraphrases automatically by learning from comparable corpora

• Domain-dependent paraphrasing

• News-oriented

• The plane bombed the town. The town was bombed by the plane.

3

Three Stages

1. Pattern Selection (Within Corpus)• Find reoccurring patterns such as “X

kicked Y”2. Paraphrase Acquisition (Across Corpora)• Pair patterns such as “X kicked Y” and “Y

was kicked by X”3. Generation• Convert “Alice kicked Bob” to “Bob was

kicked by Alice”

4

免費送一個圖解

5

1.Pattern Selection (Within Corpus)

2.Paraphrase Acquisition (Across Corpora)

3.Generation

一步一腳印

6

Pattern Selection 之 Sentence Clustering

• Use complete-link clustering to cluster similar sentences within a corpus

• Use n-gram overlap as similarity measure

• 還沒上過老師 IR 課或上課不專心或已經忘記的鄉親請參考老師的投影片 .

• Replace dates, numbers and proper names in sentences with generic tokens to account for argument variability

7

Sample Cluster

例子多的文章最棒

8

Pattern Selection 之 Inducing Patterns

• Use multiple sequence alignment (MSA), which is an extension of pairwise sequence alignment

• Pairwise sequence alignment : similar to edit distance!– Aligning two identical words scores 1; inserting a word

scores -0.01; aligning two different words scores -0.5– Want to find the alignment that has the highest score

• MSA’s scoring function is the sum of all the pairwise alignment scores

• 人算不如天算 MSA is NP-complete! But there is an approximation algorithm

9

Lattice Example

10

Identify the Variables

• We want to identify variable parts (e.g. event, people name, location, …)

• The non-variable part (backbone node) is defined as a node which is shared by more than 50% of the cluster’s sentences

• Still have the problem of argument variability (bad) and synonym variability (good)

11

Argument Variability VS Synonym Variability

• Idea: more variability in argument (e.g. different location names) than synonym

• Define synonymy threshold : 30%• If none of the parallel nodes have at least

30% of all edges pointing to it, then the parallel nodes are arguments rather than synonyms

• Replace parallel argument nodes with a slot

12

Lattice with Slots

13



3.Generation

一步一腳印

14

Paraphrase Acquisition 之速配成功• Want to pair up lattices in two different corpora (e.g. “X ki

cked Y” in Corpus A and “Y was kicked by X” in Corpus B)

• Idea : paraphrases have the same slot values• “Take a pair of lattices from different corpora, look back

at the sentence clusters from which the two lattices were derived, and compare the slot values of those cross-corpus sentence pairs that appear in articles written on the same day” – © Barzilay 2003

• For example, we can have “the plane bombed the town” in Corpus A, and “the town was bombed by the plane” in Corpus B both written on the same day.

• More overlapping slot values better

15



3.Generation

一步一腳印

16

Generation 之大功告成• Given an input sentence, use MSA to find

the most similar training sentence

• Use the training sentence’s lattice to generate paraphrases

17

Evaluation Corpora

• Corpus 1: Agence France-Presse (AFP)

• Corpus 2: Reuters

• Between September 2000 and August 2002

• Focus on violence in Isreal and army raids on Palestinian territories

• 9 MB of articles in total

18

Experiment ½ 之 Template

• Evaluate the quality of template generated• Example : Is “X kicked Y” equivalent to “Y was kicked by

X”?• Baseline : DIRT, another paraphrase system (focus on

shorter phrases)• 4 human judges• Randomly extract 250 pairs of paraphrases per system• 100 pairs (50 per system) are evaluated by all 4 judges• Each judge evaluates different 100 of the remaining 400

pairs

19

Result ½ : 開放 Call-in

20

Result ½ : 開放 Call-in

• MSA outperforms DIRT by about 38% in all cases

• But DIRT focuses only on short phrases, so it’s unfair

• But no one has done sentence-level paraphrasing before

21

Experiment 2/2 之 Paraphrase

• Evaluate the actual paraphrases generated• For testing, choose 20 AFP articles about violen

ce in Middle East, but are not in training corpus• Try to paraphrase every sentence in the 20 articl

es• Baseline : randomly substitute sentence words w

ith Wordnet synonyms• 2 judges ( 人工太貴 , 沒有五年五百億 )

22

Result 2/2 : 開放 Call-in

• 59 out of 484 sentences have paraphrases (12.2%)

Judge MSA Wordnet Synonym

1 81.4% 69.5%

2 78.0% 66.1%

23

終於• Generating sentence-level paraphrases is

not addressed previously

• Use comparable corpora instead of parallel corpora

• 實驗室已經報告過 5 篇 Barzilay 的文章 ( 不包括此篇 )

Documents

1 Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Cornell University HLT-NAACL 2003