23
1 Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lilli an Lee Cornell University HLT-NAACL 2003 (22% 哇 !)

1 Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Cornell University HLT-NAACL 2003

Embed Size (px)

Citation preview

Page 1: 1 Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Cornell University HLT-NAACL 2003

1

Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment

Regina Barzilay and Lillian Lee

Cornell University

HLT-NAACL 2003 (22% 哇 !)

Page 2: 1 Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Cornell University HLT-NAACL 2003

2

Objective

• Generate paraphrases automatically by learning from comparable corpora

• Domain-dependent paraphrasing

• News-oriented

• The plane bombed the town. The town was bombed by the plane.

Page 3: 1 Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Cornell University HLT-NAACL 2003

3

Three Stages

1. Pattern Selection (Within Corpus)• Find reoccurring patterns such as “X

kicked Y”2. Paraphrase Acquisition (Across Corpora)• Pair patterns such as “X kicked Y” and “Y

was kicked by X”3. Generation• Convert “Alice kicked Bob” to “Bob was

kicked by Alice”

Page 4: 1 Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Cornell University HLT-NAACL 2003

4

免費送一個圖解

Page 5: 1 Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Cornell University HLT-NAACL 2003

5

1.Pattern Selection (Within Corpus)

2.Paraphrase Acquisition (Across Corpora)

3.Generation

一步一腳印

Page 6: 1 Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Cornell University HLT-NAACL 2003

6

Pattern Selection 之 Sentence Clustering

• Use complete-link clustering to cluster similar sentences within a corpus

• Use n-gram overlap as similarity measure

• 還沒上過老師 IR 課或上課不專心或已經忘記的鄉親請參考老師的投影片 .

• Replace dates, numbers and proper names in sentences with generic tokens to account for argument variability

Page 7: 1 Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Cornell University HLT-NAACL 2003

7

Sample Cluster

例子多的文章最棒

Page 8: 1 Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Cornell University HLT-NAACL 2003

8

Pattern Selection 之 Inducing Patterns

• Use multiple sequence alignment (MSA), which is an extension of pairwise sequence alignment

• Pairwise sequence alignment : similar to edit distance!– Aligning two identical words scores 1; inserting a word

scores -0.01; aligning two different words scores -0.5– Want to find the alignment that has the highest score

• MSA’s scoring function is the sum of all the pairwise alignment scores

• 人算不如天算 MSA is NP-complete! But there is an approximation algorithm

Page 9: 1 Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Cornell University HLT-NAACL 2003

9

Lattice Example

Page 10: 1 Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Cornell University HLT-NAACL 2003

10

Identify the Variables

• We want to identify variable parts (e.g. event, people name, location, …)

• The non-variable part (backbone node) is defined as a node which is shared by more than 50% of the cluster’s sentences

• Still have the problem of argument variability (bad) and synonym variability (good)

Page 11: 1 Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Cornell University HLT-NAACL 2003

11

Argument Variability VS Synonym Variability

• Idea: more variability in argument (e.g. different location names) than synonym

• Define synonymy threshold : 30%• If none of the parallel nodes have at least

30% of all edges pointing to it, then the parallel nodes are arguments rather than synonyms

• Replace parallel argument nodes with a slot

Page 12: 1 Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Cornell University HLT-NAACL 2003

12

Lattice with Slots

Page 13: 1 Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Cornell University HLT-NAACL 2003

13

1.Pattern Selection (Within Corpus)

2.Paraphrase Acquisition (Across Corpora)

3.Generation

一步一腳印

Page 14: 1 Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Cornell University HLT-NAACL 2003

14

Paraphrase Acquisition 之速配成功• Want to pair up lattices in two different corpora (e.g. “X ki

cked Y” in Corpus A and “Y was kicked by X” in Corpus B)

• Idea : paraphrases have the same slot values• “Take a pair of lattices from different corpora, look back

at the sentence clusters from which the two lattices were derived, and compare the slot values of those cross-corpus sentence pairs that appear in articles written on the same day” – © Barzilay 2003

• For example, we can have “the plane bombed the town” in Corpus A, and “the town was bombed by the plane” in Corpus B both written on the same day.

• More overlapping slot values better

Page 15: 1 Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Cornell University HLT-NAACL 2003

15

1.Pattern Selection (Within Corpus)

2.Paraphrase Acquisition (Across Corpora)

3.Generation

一步一腳印

Page 16: 1 Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Cornell University HLT-NAACL 2003

16

Generation 之大功告成• Given an input sentence, use MSA to find

the most similar training sentence

• Use the training sentence’s lattice to generate paraphrases

Page 17: 1 Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Cornell University HLT-NAACL 2003

17

Evaluation Corpora

• Corpus 1: Agence France-Presse (AFP)

• Corpus 2: Reuters

• Between September 2000 and August 2002

• Focus on violence in Isreal and army raids on Palestinian territories

• 9 MB of articles in total

Page 18: 1 Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Cornell University HLT-NAACL 2003

18

Experiment ½ 之 Template

• Evaluate the quality of template generated• Example : Is “X kicked Y” equivalent to “Y was kicked by

X”?• Baseline : DIRT, another paraphrase system (focus on

shorter phrases)• 4 human judges• Randomly extract 250 pairs of paraphrases per system• 100 pairs (50 per system) are evaluated by all 4 judges• Each judge evaluates different 100 of the remaining 400

pairs

Page 19: 1 Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Cornell University HLT-NAACL 2003

19

Result ½ : 開放 Call-in

Page 20: 1 Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Cornell University HLT-NAACL 2003

20

Result ½ : 開放 Call-in

• MSA outperforms DIRT by about 38% in all cases

• But DIRT focuses only on short phrases, so it’s unfair

• But no one has done sentence-level paraphrasing before

Page 21: 1 Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Cornell University HLT-NAACL 2003

21

Experiment 2/2 之 Paraphrase

• Evaluate the actual paraphrases generated• For testing, choose 20 AFP articles about violen

ce in Middle East, but are not in training corpus• Try to paraphrase every sentence in the 20 articl

es• Baseline : randomly substitute sentence words w

ith Wordnet synonyms• 2 judges ( 人工太貴 , 沒有五年五百億 )

Page 22: 1 Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Cornell University HLT-NAACL 2003

22

Result 2/2 : 開放 Call-in

• 59 out of 484 sentences have paraphrases (12.2%)

Judge MSA Wordnet Synonym

1 81.4% 69.5%

2 78.0% 66.1%

Page 23: 1 Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Cornell University HLT-NAACL 2003

23

終於• Generating sentence-level paraphrases is

not addressed previously

• Use comparable corpora instead of parallel corpora

• 實驗室已經報告過 5 篇 Barzilay 的文章 ( 不包括此篇 )