8
Extracting Paraphrases from a Parallel Corpus Regina Barzilay and Kathleen R. McKeown Computer Science Department Columbia University 10027, New York, NY, USA {regina,kathy}@cs.columbia.edu Abstract While paraphrasing is critical both for interpretation and generation of natu- ral language, current systems use man- ual or semi-automatic methods to col- lect paraphrases. We present an un- supervised learning algorithm for iden- tification of paraphrases from a cor- pus of multiple English translations of the same source text. Our approach yields phrasal and single word lexical paraphrases as well as syntactic para- phrases. 1 Introduction Paraphrases are alternative ways to convey the same information. A method for the automatic acquisition of paraphrases has both practical and linguistic interest. From a practical point of view, diversity in expression presents a major challenge for many NLP applications. In multidocument summarization, identification of paraphrasing is required to find repetitive information in the in- put documents. In generation, paraphrasing is employed to create more varied and fluent text. Most current applications use manually collected paraphrases tailored to a specific application, or utilize existing lexical resources such as Word- Net (Miller et al., 1990) to identify paraphrases. However, the process of manually collecting para- phrases is time consuming, and moreover, the col- lection is not reusable in other applications. Ex- isting resources only include lexical paraphrases; they do not include phrasal or syntactically based paraphrases. From a linguistic point of view, questions concern the operative definition of paraphrases: what types of lexical relations and syntactic mechanisms can produce paraphrases? Many linguists (Halliday, 1985; de Beaugrande and Dressler, 1981) agree that paraphrases retain “ap- proximate conceptual equivalence”, and are not limited only to synonymy relations. But the ex- tent of interchangeability between phrases which form paraphrases is an open question (Dras, 1999). A corpus-based approach can provide in- sights on this question by revealing paraphrases that people use. This paper presents a corpus-based method for automatic extraction of paraphrases. We use a large collection of multiple parallel English trans- lations of novels 1 . This corpus provides many instances of paraphrasing, because translations preserve the meaning of the original source, but may use different words to convey the mean- ing. An example of parallel translations is shown in Figure 1. It contains two pairs of para- phrases: (“burst into tears”, “cried”) and (“com- fort”, “console”). Emma burst into tears and he tried to comfort her, say- ing things to make her smile. Emma cried, and he tried to console her, adorning his words with puns. Figure 1: Two English translations of the French sentence from Flaubert’s “Madame Bovary” Our method for paraphrase extraction builds upon methodology developed in Machine Trans- lation (MT). In MT, pairs of translated sentences from a bilingual corpus are aligned, and occur- rence patterns of words in two languages in the text are extracted and matched using correlation measures. However, our parallel corpus is far from the clean parallel corpora used in MT. The 1 Foreign sources are not used in our experiment.

Extracting Paraphrases from a Parallel Corpus · Extracting Paraphrases from a Parallel Corpus Regina Barzilay and Kathleen R. McKeown ... 1985; de Beaugrande and Dressler, 1981)

Embed Size (px)

Citation preview

Extracting Paraphrases from a Parallel Corpus

Regina Barzilay and Kathleen R. McKeownComputer Science Department

Columbia University10027, New York, NY, USA

{regina,kathy}@cs.columbia.edu

Abstract

While paraphrasing is critical both forinterpretation and generation of natu-ral language, current systems use man-ual or semi-automatic methods to col-lect paraphrases. We present an un-supervised learning algorithm for iden-tification of paraphrases from a cor-pus of multiple English translations ofthe same source text. Our approachyields phrasal and single word lexicalparaphrases as well as syntactic para-phrases.

1 Introduction

Paraphrases are alternative ways to convey thesame information. A method for the automaticacquisition of paraphrases has both practical andlinguistic interest. From a practical point of view,diversity in expression presents a major challengefor many NLP applications. In multidocumentsummarization, identification of paraphrasing isrequired to find repetitive information in the in-put documents. In generation, paraphrasing isemployed to create more varied and fluent text.Most current applications use manually collectedparaphrases tailored to a specific application, orutilize existing lexical resources such as Word-Net (Miller et al., 1990) to identify paraphrases.However, the process of manually collecting para-phrases is time consuming, and moreover, the col-lection is not reusable in other applications. Ex-isting resources only include lexical paraphrases;they do not include phrasal or syntactically basedparaphrases.

From a linguistic point of view, questionsconcern the operative definition of paraphrases:

what types of lexical relations and syntacticmechanisms can produce paraphrases? Manylinguists (Halliday, 1985; de Beaugrande andDressler, 1981) agree that paraphrases retain “ap-proximate conceptual equivalence”, and are notlimited only to synonymy relations. But the ex-tent of interchangeability between phrases whichform paraphrases is an open question (Dras,1999). A corpus-based approach can provide in-sights on this question by revealing paraphrasesthat people use.

This paper presents a corpus-based method forautomatic extraction of paraphrases. We use alarge collection of multiple parallel English trans-lations of novels1. This corpus provides manyinstances of paraphrasing, because translationspreserve the meaning of the original source, butmay use different words to convey the mean-ing. An example of parallel translations is shownin Figure 1. It contains two pairs of para-phrases: (“burst into tears”, “cried”) and (“com-fort”, “console”).

Emma burst into tears and he tried to comfort her, say-ing things to make her smile.Emma cried, and he tried to console her, adorning hiswords with puns.

Figure 1: Two English translations of the Frenchsentence from Flaubert’s “Madame Bovary”

Our method for paraphrase extraction buildsupon methodology developed in Machine Trans-lation (MT). In MT, pairs of translated sentencesfrom a bilingual corpus are aligned, and occur-rence patterns of words in two languages in thetext are extracted and matched using correlationmeasures. However, our parallel corpus is farfrom the clean parallel corpora used in MT. The

1Foreign sources are not used in our experiment.

rendition of a literary text into another languagenot only includes the translation, but also restruc-turing of the translation to fit the appropriate lit-erary style. This process introduces differencesin the translations which are an intrinsic part ofthe creative process. This results in greater dif-ferences across translations than the differencesin typical MT parallel corpora, such as the Cana-dian Hansards. We will return to this point laterin Section 3.

Based on the specifics of our corpus, we de-veloped an unsupervised learning algorithm forparaphrase extraction. During the preprocessingstage, the corresponding sentences are aligned.We base our method for paraphrasing extractionon the assumption that phrases in aligned sen-tences which appear in similar contexts are para-phrases. To automatically infer which contextsare good predictors of paraphrases, contexts sur-rounding identical words in aligned sentences areextracted and filtered according to their predic-tive power. Then, these contexts are used to ex-tract new paraphrases. In addition to learning lex-ical paraphrases, the method also learns syntacticparaphrases, by generalizing syntactic patterns ofthe extracted paraphrases. Extracted paraphrasesare then applied to the corpus, and used to learnnew context rules. This iterative algorithm con-tinues until no new paraphrases are discovered.

A novel feature of our approach is the ability toextract multiple kinds of paraphrases:Identification of lexical paraphrases. In con-trast to earlier work on similarity, our approachallows identification of multi-word paraphrases,in addition to single words, a challenging issuefor corpus-based techniques.Extraction of morpho-syntactic paraphrasingrules. Our approach yields a set of paraphras-ing patterns by extrapolating the syntactic andmorphological structure of extracted paraphrases.This process relies on morphological informationand a part-of-speech tagging. Many of the rulesidentified by the algorithm match those that havebeen described as productive paraphrases in thelinguistic literature.

In the following sections, we provide anoverview of existing work on paraphrasing, thenwe describe data used in this work, and detail ourparaphrase extraction technique. We present re-

sults of our evaluation, and conclude with a dis-cussion of our results.

2 Related Work on Paraphrasing

Many NLP applications are required to deal withthe unlimited variety of human language in ex-pressing the same information. So far, threemajor approaches of collecting paraphrases haveemerged: manual collection, utilization of exist-ing lexical resources and corpus-based extractionof similar words.

Manual collection of paraphrases is usuallyused in generation (Iordanskaja et al., 1991;Robin, 1994). Paraphrasing is an inevitable partof any generation task, because a semantic con-cept can be realized in many different ways.Knowledge of possible concept verbalizations canhelp to generate a text which best fits existing syn-tactic and pragmatic constraints. Traditionally, al-ternative verbalizations are derived from a man-ual corpus analysis, and are, therefore, applica-tion specific.

The second approach — utilization of existinglexical resources, such as WordNet — overcomesthe scalability problem associated with an appli-cation specific collection of paraphrases. Lexicalresources are used in statistical generation, sum-marization and question-answering. The ques-tion here is what type of WordNet relations canbe considered as paraphrases. In some appli-cations, only synonyms are considered as para-phrases (Langkilde and Knight, 1998); in others,looser definitions are used (Barzilay and Elhadad,1997). These definitions are valid in the contextof particular applications; however, in general, thecorrespondence between paraphrasing and typesof lexical relations is not clear. The same ques-tion arises with automatically constructed the-sauri (Pereira et al., 1993; Lin, 1998). Whilethe extracted pairs are indeed similar, they are notparaphrases. For example, while “dog” and “cat”are recognized as the most similar concepts bythe method described in (Lin, 1998), it is hardto imagine a context in which these words wouldbe interchangeable.

The first attempt to derive paraphrasing rulesfrom corpora was undertaken by (Jacquemin etal., 1997), who investigated morphological andsyntactic variants of technical terms. While these

rules achieve high accuracy in identifying termparaphrases, the techniques used have not beenextended to other types of paraphrasing yet. Sta-tistical techniques were also successfully usedby (Lapata, 2001) to identify paraphrases ofadjective-noun phrases. In contrast, our methodis not limited to a particular paraphrase type.

3 The Data

The corpus we use for identification of para-phrases is a collection of multiple English trans-lations from a foreign source text. Specifically,we use literary texts written by foreign authors.Many classical texts have been translated morethan once, and these translations are availableon-line. In our experiments we used 5 books,among them, Flaubert’s Madame Bovary, Ander-sen’s Fairy Tales and Verne’s Twenty ThousandLeagues Under the Sea. Some of the translationswere created during different time periods and indifferent countries. In total, our corpus contains11 translations 2.

At first glance, our corpus seems quite simi-lar to parallel corpora used by researchers in MT,such as the Canadian Hansards. The major dis-tinction lies in the degree of proximity betweenthe translations. Analyzing multiple translationsof the literary texts, critics (e.g. (Wechsler, 1998))have observed that translations “are never iden-tical”, and each translator creates his own inter-pretations of the text. Clauses such as “adorninghis words with puns” and “saying things to makeher smile” from the sentences in Figure 1 are ex-amples of distinct translations. Therefore, a com-plete match between words of related sentencesis impossible. This characteristic of our corpusis similar to problems with noisy and comparablecorpora (Veronis, 2000), and it prevents us fromusing methods developed in the MT communitybased on clean parallel corpora, such as (Brownet al., 1993).

Another distinction between our corpus andparallel MT corpora is the irregularity of wordmatchings: in MT, no words in the source lan-guage are kept as is in the target language trans-lation; for example, an English translation of

2Free of copyright restrictions part ofour corpus(9 translations) is available athttp://www.cs.columbia.edu/˜regina /par.

a French source does not contain untranslatedFrench fragments. In contrast, in our corpusthe same word is usually used in both transla-tions, and only sometimes its paraphrases areused, which means that word–paraphrase pairswill have lower co-occurrence rates than word–translation pairs in MT. For example, consider oc-currences of the word “boy” in two translations of“Madame Bovary” — E. Marx-Aveling’s transla-tion and Etext’s translation. The first text contains55 occurrences of “boy”, which correspond to 38occurrences of “boy” and 17 occurrences of itsparaphrases (“son”, “young fellow” and “young-ster”). This rules out using word translation meth-ods based only on word co-occurrence counts.

On the other hand, the big advantage of our cor-pus comes from the fact that parallel translationsshare many words, which helps the matching pro-cess. We describe below a method of paraphraseextraction, exploiting these features of our corpus.

4 Preprocessing

During the preprocessing stage, we perform sen-tence alignment. Sentences which are translationsof the same source sentence contain a number ofidentical words, which serve as a strong clue tothe matching process. Alignment is performedusing dynamic programming (Gale and Church,1991) with a weight function based on the num-ber of common words in a sentence pair. Thissimple method achieves good results for our cor-pus, because 42% of the words in correspondingsentences are identical words on average. Align-ment produces 44,562 pairs of sentences with1,798,526 words. To evaluate the accuracy ofthe alignment process, we analyzed 127 sentencepairs from the algorithm’s output. 120(94.5%)alignments were identified as correct alignments.

We then use a part-of-speech tagger and chun-ker (Mikheev, 1997) to identify noun and verbphrases in the sentences. These phrases becomethe atomic units of the algorithm. We also recordfor each token its derivational root, using theCELEX(Baayen et al., 1993) database.

5 Method for Paraphrase Extraction

Given the aforementioned differences betweentranslations, our method builds on similarity in

the local context, rather than on global alignment.Consider the two sentences in Figure 2.

And finally, dazzlingly white, it shone high abovethem in the empty ? .

It appeared white and dazzling in the empty ? .

Figure 2: Fragments of aligned sentences

Analyzing the contexts surrounding “ ? ”-marked blanks in both sentences, one expects thatthey should have the same meaning, because theyhave the same premodifier “empty” and relate tothe same preposition “in” (in fact, the first “ ? ”stands for “sky”, and the second for “heavens”).Generalizing from this example, we hypothesizethat if the contexts surrounding two phrases looksimilar enough, then these two phrases are likelyto be paraphrases. The definition of the contextdepends on how similar the translations are. Oncewe know which contexts are good paraphrase pre-dictors, we can extract paraphrase patterns fromour corpus.

Examples of such contexts are verb-object re-lations and noun-modifier relations, which weretraditionally used in word similarity tasks fromnon-parallel corpora (Pereira et al., 1993; Hatzi-vassiloglou and McKeown, 1993). However, inour case, more indirect relations can also be cluesfor paraphrasing, because we know a priori thatinput sentences convey the same information. Forexample, in sentences from Figure 3, the verbs“ringing” and “sounding” do not share identicalsubject nouns, but the modifier of both subjects“Evening” is identical. Can we conclude thatidentical modifiers of the subject imply verb sim-ilarity? To address this question, we need a wayto identify contexts that are good predictors forparaphrasing in a corpus.

People said “The Evening Noise is sounding, the sunis setting.”“The evening bell is ringing,” people used to say.

Figure 3: Fragments of aligned sentences

To find “good” contexts, we can analyze allcontexts surrounding identical words in the pairsof aligned sentences, and use these contexts tolearn new paraphrases. This provides a basis fora bootstrapping mechanism. Starting with identi-cal words in aligned sentences as a seed, we can

incrementally learn the “good” contexts, and inturn use them to learn new paraphrases. Iden-tical words play two roles in this process: first,they are used to learn context rules; second, iden-tical words are used in application of these rules,because the rules contain information about theequality of words in context.

This method of co-training has been previouslyapplied to a variety of natural language tasks,such as word sense disambiguation (Yarowsky,1995), lexicon construction for information ex-traction (Riloff and Jones, 1999), and named en-tity classification (Collins and Singer, 1999). Inour case, the co-training process creates a binaryclassifier, which predicts whether a given pair ofphrases makes a paraphrase or not.

Our model is based on the DLCoTrain algo-rithm proposed by (Collins and Singer, 1999),which applies a co-training procedure to decisionlist classifiers for two independent sets of fea-tures. In our case, one set of features describes theparaphrase pair itself, and another set of featurescorresponds to contexts in which paraphrases oc-cur. These features and their computation are de-scribed below.

5.1 Feature Extraction

Our paraphrase features include lexical and syn-tactic descriptions of the paraphrase pair. Thelexical feature set consists of the sequence of to-kens for each phrase in the paraphrase pair; thesyntactic feature set consists of a sequence ofpart-of-speech tags where equal words and wordswith the same root are marked. For example, thevalue of the syntactic feature for the pair (“thevast chimney”, “the chimney”) is (“DT1 JJ NN2”,“DT1 NN2”), where indices indicate word equali-ties. We believe that this feature can be useful fortwo reasons: first, we expect that some syntac-tic categories can not be paraphrased in anothersyntactic category. For example, a determiner isunlikely to be a paraphrase of a verb. Second,this description is able to capture regularities inphrase level paraphrasing. In fact, a similar rep-resentation was used by (Jacquemin et al., 1997)to describe term variations.

The contextual feature is a combination ofthe left and right syntactic contexts surroundingactual known paraphrases. There are a num-

ber of context representations that can be con-sidered as possible candidates: lexical n-grams,POS-ngrams and parse tree fragments. The nat-ural choice is a parse tree; however, existingparsers perform poorly in our domain3. Part-of-speech tags provide the required level of ab-straction, and can be accurately computed for ourdata. The left (right) context is a sequence ofpart-of-speech tags of n words, occurring on theleft (right) of the paraphrase. As in the caseof syntactic paraphrase features, tags of identi-cal words are marked. For example, when n =2, the contextual feature for the paraphrase pair(“comfort”, “console”) from Figure 1 sentencesis left1=“VB1 TO2”, (“tried to”), left2=“VB1

TO2”, (“tried to”), right1=“PRP$3 ,4”, (“her,”)right context$2=“PRP$3 ,4”, (“her,”). In the nextsection, we describe how the classifiers for con-textual and paraphrasing features are co-trained.

5.2 The co-training algorithm

Our co-training algorithm has three stages: ini-tialization, training of the contextual classifier andtraining of the paraphrasing classifiers.

Initialization Words which appear in both sen-tences of an aligned pair are used to create the ini-tial “seed” rules. Using identical words, we cre-ate a set of positive paraphrasing examples, suchas word1=tried, word2=tried. However, train-ing of the classifier demands negative examplesas well; in our case it requires pairs of wordsin aligned sentences which are not paraphrasesof each other. To find negative examples, wematch identical words in the alignment againstall different words in the aligned sentence, as-suming that identical words can match only eachother, and not any other word in the aligned sen-tences. For example, “tried” from the first sen-tence in Figure 1 does not correspond to any otherword in the second sentence but “tried”. Basedon this observation, we can derive negative ex-amples such as word1=tried, word2=Emma andword1=tried, word2=console. Given a pair ofidentical words from two sentences of length nand m, the algorithm produces one positive ex-

3To the best of our knowledge all existing statisticalparsers are trained on WSJ or similar type of corpora. In theexperiments we conducted, their performance significantlydegraded on our corpus — literary texts.

ample and (n− 1) + (m− 1) negative examples.Training of the contextual classifier Using

this initial seed, we record contexts around pos-itive and negative paraphrasing examples. Fromall the extracted contexts we must identify theones which are strong predictors of their category.Following (Collins and Singer, 1999), filtering isbased on the strength of the context and its fre-quency. The strength of positive context x is de-fined as count(x+)/count(x), where count(x+)is the number of times context x surrounds posi-tive examples (paraphrase pairs) and count(x) isthe frequency of the context x. Strength of thenegative context is defined in a symmetrical man-ner. For the positive and the negative categorieswe select k rules (k = 10 in our experiments)with the highest frequency and strength higherthan the predefined threshold of 95%. Examplesof selected context rules are shown in Figure 4.

The parameter of the contextual classifier is acontext length. In our experiments we found thata maximal context length of three produces bestresults. We also observed that for some rules ashorter context works better. Therefore, whenrecording contexts around positive and negativeexamples, we record all the contexts with lengthsmaller or equal to the maximal length.

Because our corpus consists of translations ofseveral books, created by different translators,we expect that the similarity between translationsvaries from one book to another. This implies thatcontextual rules should be specific to a particularpair of translations. Therefore, we train the con-textual classifier for each pair of translations sep-arately.

left1 = (VB0 TO1) right1 = (PRP$2 ,)left2 = (VB0 TO1) right2 = (PRP$2 ,)left1 = (WRB0 NN1) right1 = (NN2 IN)left2 = (WRB0 NN1) right2 = (NN2 IN)left1 = (VB0) right1 = (JJ1)left2 = (VB0) right2 = (JJ1)left1 = (IN NN0) right1 = (NN2 IN3)left2 = (NN0 ,) right2 = (NN2 IN3)

Figure 4: Example of context rules extracted bythe algorithm.

Training of the paraphrasing classifier Con-text rules extracted in the previous stage are thenapplied to the corpus to derive a new set of pairs

of positive and negative paraphrasing examples.Applications of the rule performed by searchingsentence pairs for subsequences which match theleft and right parts of the contextual rule, and areless than N tokens apart. For example, apply-ing the first rule from Figure 4 to sentences fromFigure 1 yields the paraphrasing pair (“comfort”,“console”). Note that in the original seed set, theleft and right contexts were separated by one to-ken. This stretch in rule application allows us toextract multi-word paraphrases.

For each extracted example, paraphrasing rulesare recorded and filtered in a similar manner ascontextual rules. Examples of lexical and syntac-tic paraphrasing rules are shown in Figure 5 andin Figure 6. After extracted lexical and syntacticparaphrases are applied to the corpus, the contex-tual classifier is retrained. New paraphrases notonly add more positive and negative instances tothe contextual classifier, but also revise contex-tual rules for known instances based on new para-phrase information.

(NN0 POS NN1)↔(NN1 IN DT NN0)King’s son son of the king

(IN NN0)↔(VB0)in bottles bottled

(VB0 to VB1)↔(VB0 VB1)start to talk start talking(VB0 RB1)↔(RB1 VB0)

suddenly came came suddenly(VB NN0)↔(VB0)

make appearance appear

Figure 5: Morpho-Syntactic patterns extracted bythe algorithm. Lower indices denote token equiv-alence, upper indices denote root equivalence.

(countless, lots of) (repulsion, aversion)(undertone, low voice) (shrubs, bushes)(refuse, say no) (dull tone, gloom)(sudden appearance, apparition)

Figure 6: Lexical paraphrases extracted by the al-gorithm.

The iterative process is terminated when nonew paraphrases are discovered or the number ofiterations exceeds a predefined threshold.

6 The results

Our algorithm produced 9483 pairs of lexicalparaphrases and 25 morpho-syntactic rules. To

evaluate the quality of produced paraphrases, wepicked at random 500 paraphrasing pairs from thelexical paraphrases produced by our algorithm.These pairs were used as test data and also to eval-uate whether humans agree on paraphrasing judg-ments. The judges were given a page of guide-lines, defining paraphrase as “approximate con-ceptual equivalence”. The main dilemma in de-signing the evaluation is whether to include thecontext: should the human judge see only a para-phrase pair or should a pair of sentences contain-ing these paraphrases also be given? In a simi-lar MT task — evaluation of word-to-word trans-lation — context is usually included (Melamed,2001). Although paraphrasing is considered tobe context dependent, there is no agreement onthe extent. To evaluate the influence of contexton paraphrasing judgments, we performed twoexperiments — with and without context. First,the human judge is given a paraphrase pair with-out context, and after the judge entered his an-swer, he is given the same pair with its surround-ing context. Each context was evaluated by twojudges (other than the authors). The agreementwas measured using the Kappa coefficient (Siegeland Castellan, 1988). Complete agreement be-tween judges would correspond to K equals 1;if there is no agreement among judges, then Kequals 0.

The judges agreement on the paraphrasingjudgment without context was K = 0.68which is substantial agreement (Landis and Koch,1977). The first judge found 439(87.8%) pairsas correct paraphrases, and the second judge —426(85.2%). Judgments with context have evenhigher agreement (K = 0.97), and judges identi-fied 459(91.8%) and 457(91.4%) pairs as correctparaphrases.

The recall of our method is a more problematicissue. The algorithm can identify paraphrasing re-lations only between words which occurred in ourcorpus, which of course does not cover all Englishtokens. Furthermore, direct comparison with anelectronic thesaurus like WordNet is impossible,because it is not known a priori which lexical re-lations in WordNet can form paraphrases. Thus,we can not evaluate recall. We hand-evaluatedthe coverage, by asking a human judges to extractparaphrases from 50 sentences, and then counted

how many of these paraphrases where predictedby our algorithm. From 70 paraphrases extractedby human judge, 48(69%) were identified as para-phrases by our algorithm.

In addition to evaluating our system outputthrough precision and recall, we also comparedour results with two other methods. The first ofthese was a machine translation technique for de-riving bilingual lexicons (Melamed, 2001) includ-ing detection of non-compositional compounds 4.We did this evaluation on 60% of the full dataset;this is the portion of the data which is pub-licly available. Our system produced 6,826 wordpairs from this data and Melamed provided thetop 6,826 word pairs resulting from his systemon this data. We randomly extracted 500 pairseach from both sets of output. Of the 500 pairsproduced by our system, 354(70.8%) were sin-gle word pairs and 146(29.2%) were multi-wordparaphrases, while the majority of pairs producedby Melamed’s system were single word pairs(90%). We mixed this output and gave the re-sulting, randomly ordered 1000 pairs to six eval-uators, all of whom were native speakers. Eachevaluator provided judgments on 500 pairs with-out context. Precision for our system was 71.6%and for Melamed’s was 52.7%. This increasedprecision is a clear advantage of our approach andshows that machine translation techniques cannotbe used without modification for this task, par-ticularly for producing multi-word paraphrases.There are three caveats that should be noted;Melamed’s system was run without changes forthis new task of paraphrase extraction and his sys-tem does not use chunk segmentation, he ran thesystem for three days of computation and the re-sult may be improved with more running timesince it makes incremental improvements on sub-sequent rounds, and finally, the agreement be-tween human judges was lower than in our pre-vious experiments. We are currently exploringwhether the information produced by the two dif-ferent systems may be combined to improve theperformance of either system alone.

Another view on the extracted paraphrases canbe derived by comparing them with the Word-Net thesaurus. This comparison provides us with

4The equivalences that were identical on both sides wereremoved from the output

quantitative evidence on the types of lexical re-lations people use to create paraphrases. We se-lected 112 paraphrasing pairs which occurred atleast 20 times in our corpus and such that thewords comprising each pair appear in WordNet.The 20 times cutoff was chosen to ensure thatthe identified pairs are general enough and notidiosyncratic. We use the frequency thresholdto select paraphrases which are not tailored toone context. Examples of paraphrases and theirWordNet relations are shown in Figure 7. Only40(35%) paraphrases are synonyms, 36(32%) arehyperonyms, 20(18%) are siblings in the hyper-onym tree, 11(10%) are unrelated, and the re-maining 5% are covered by other relations. Thesefigures quantitatively validate our intuition thatsynonymy is not the only source of paraphras-ing. One of the practical implications is that us-ing synonymy relations exclusively to recognizeparaphrasing limits system performance.

Synonyms: (rise, stand up), (hot, warm)Hyperonyms: (landlady, hostess), (reply, say)Siblings: (city, town), (pine, fir)Unrelated: (sick, tired), (next, then)

Figure 7: Lexical paraphrases extracted by the al-gorithm.

7 Conclusions and Future work

In this paper, we presented a method for corpus-based identification of paraphrases from multi-ple English translations of the same source text.We showed that a co-training algorithm based oncontextual and lexico-syntactic features of para-phrases achieves high performance on our data.The wide range of paraphrases extracted by ouralgorithm sheds light on the paraphrasing phe-nomena, which has not been studied from an em-pirical perspective.

Future work will extend this approach to ex-tract paraphrases from comparable corpora, suchas multiple reports from different news agenciesabout the same event or different descriptions ofa disease from the medical literature. This exten-sion will require using a more selective alignmenttechnique (similar to that of (Hatzivassiloglou etal., 1999)). We will also investigate a more pow-erful representation of contextual features. Fortu-nately, statistical parsers produce reliable results

on news texts, and therefore can be used to im-prove context representation. This will allow usto extract macro-syntactic paraphrases in additionto local paraphrases which are currently producedby the algorithm.

Acknowledgments

This work was partially supported by a LouisMorin scholarship and by DARPA grant N66001-00-1-8919 under the TIDES program. We aregrateful to Dan Melamed for providing us withthe output of his program. We thank Noemie El-hadad, Mike Collins, Michael Elhadad and MariaLapata for useful discussions.

ReferencesR. H. Baayen, R. Piepenbrock, and H. van Rijn, editors.

1993. The CELEX Lexical Database(CD-ROM). Lin-guistic Data Consortium, University of Pennsylvania.

R. Barzilay and M. Elhadad. 1997. Using lexical chains fortext summarization. In Proceedings of the ACL Workshopon Intelligent Scalable Text Summarization, pages 10–17,Madrid, Spain, August.

P. Brown, S. Della Pietra, V. Della Pietra, and R. Mercer.1993. The mathematics of statistical machine transla-tion: Parameter estimation. Computational Linguistics,19(2):263–311.

M. Collins and Y. Singer. 1999. Unsupervised models fornamed entity classification. In proceedings of the JointSIGDAT Conference on Empirical Methods in NaturalLanguage Processing and Very Large Corpora.

R. de Beaugrande and W. V. Dressler. 1981. Introduction toText Linguistics. Longman, New York, NY.

M. Dras. 1999. Tree Adjoining Grammar and the ReluctantParaphrasing of Text. Ph.D. thesis, Macquarie Univer-sity, Australia.

W. Gale and K. W. Church. 1991. A program for align-ing sentences in bilingual corpora. In Proceedings ofthe 29th Annual Meeting of the Association for Computa-tional Linguistics, pages 1–8.

M. Halliday. 1985. An introduction to functional grammar.Edward Arnold, UK.

V. Hatzivassiloglou and K.R. McKeown. 1993. Towards theautomatic identification of adjectival scales: Clusteringadjectives according to their meaning. In Proceedings ofthe 31rd Annual Meeting of the Association for Compu-tational Linguistics, pages 172–182.

V. Hatzivassiloglou, J. Klavans, and E. Eskin. 1999. Detect-ing text similarity over short passages: Exploring linguis-tic feature combinations via machine learning. In pro-ceedings of the Joint SIGDAT Conference on EmpiricalMethods in Natural Language Processing and Very LargeCorpora.

L. Iordanskaja, R. Kittredge, and A. Polguere, 1991. Naturallanguage Generation in Artificial Intelligence and Com-putational Linguistics, chapter 11. Kluwer AcademicPublishers.

C. Jacquemin, J. Klavans, and E. Tzoukermann. 1997. Ex-pansion of multi-word terms for indexing and retrievalusing morphology and syntax. In proceedings of the 35thAnnual Meeting of the ACL, pages 24–31, Madrid, Spain,July. ACL.

J.R. Landis and G.G. Koch. 1977. The measurementof observer agreement for categorical data. Biometrics,33:159–174.

I. Langkilde and K. Knight. 1998. Generation that exploitscorpus-based statistical knowledge. In proceedings of theCOLING-ACL.

Maria Lapata. 2001. A corpus-based account of regular pol-ysemy: The case of context-sensitive adjectives. In Pro-ceedings of the 2nd Meeting of the NAACL, Pittsburgh,PA.

D. Lin. 1998. Automatic retrieval and clustering of similarwords. In proceedings of the COLING-ACL, pages 768–774.

Melamed. 2001. Empirical Methods for Exploiting ParallelTexts. MIT press.

A. Mikheev. 1997. the ltg part of speech tagger. Universityof Edinburgh.

G.A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K.J.Miller. 1990. Introduction to WordNet: An on-line lexi-cal database. International Journal of Lexicography (spe-cial issue), 3(4):235–245.

F. Pereira, N. Tishby, and L. Lee. 1993. Distributional clus-tering of english words. In proceedings of the 30th An-nual Meeting of the ACL, pages 183–190. ACL.

E. Riloff and R. Jones. 1999. Learning Dictionariesfor Information Extraction by Multi-level Boot-strapping.In Proceedings of the Sixteenth National Conferenceon Artificial Intelligence, pages 1044–1049. The AAAIPress/MIT Press.

J. Robin. 1994. Revision-Based Generation of NaturalLanguage Summaries Providing Historical Background:Corpus-Based Analysis, Design, Implementation, andEvaluation. Ph.D. thesis, Department of Computer Sci-ence, Columbia University, NY.

S. Siegel and N.J. Castellan. 1988. Non Parametric Statis-tics for Behavioral Sciences. McGraw-Hill.

J. Veronis, editor. 2000. Parallel Text Processing: Align-ment and Use of Translation Corpora. Kluwer AcademicPublishers.

R. Wechsler. 1998. Performing Without a Stage: The Art ofLiterary Translation. Catbird Press.

D. Yarowsky. 1995. Unsupervised word sense disambigua-tion rivaling supervised methods. In Proceedings of the33rd Annual Meeting of the Association for Computa-tional Linguistics, pages 189– 196.