Exploiting discourse information to identify paraphrases

Expert Systems with Applications 41 (2014) 2832–2841

Contents lists available at ScienceDirect

Expert Systems with Applications

journal homepage: www.elsevier .com/locate /eswa

Exploiting discourse information to identify paraphrases q

0957-4174/$ - see front matter � 2013 Elsevier Ltd. All rights reserved.http://dx.doi.org/10.1016/j.eswa.2013.10.018

q This paper is an improved and extended version of the paper: EDU-BasedSimilarity for Paraphrase Identification, presented at International Conference onApplications of Natural Language to Information Systems (NLDB), 2013, UK.⇑ Corresponding author. Tel.: +81 80 4254 0684.

E-mail address: [email protected] (N.X. Bach).

Ngo Xuan Bach ⇑, Nguyen Le Minh, Akira ShimazuSchool of Information Science, Japan Advanced Institute of Science and Technology, 1-1 Asahidai, Nomi, Ishikawa 923-1292, Japan

a r t i c l e i n f o a b s t r a c t

Keywords:Paraphrase identificationText similarityDiscourse segmentationElementary discourse unitMT metricSupport vector machine

Previous work on paraphrase identification using sentence similarities has not exploited discourse struc-tures, which have been shown as important information for paraphrase computation. In this paper, wepropose a new method named EDU-based similarity, to compute the similarity between two sentencesbased on elementary discourse units. Unlike conventional methods, which directly compute similaritiesbased on sentences, our method divides sentences into discourse units and employs them to computesimilarities. We also show the relation between paraphrases and discourse units, which plays an impor-tant role in paraphrasing. We apply our method to the paraphrase identification task. Experimentalresults on the PAN corpus, a large corpus for detecting paraphrases, show the effectiveness of using dis-course information for identifying paraphrases. We achieve 93.1% and 93.4% accuracy, respectively byusing a single SVM classifier and by using a maximal voting model.

� 2013 Elsevier Ltd. All rights reserved.

1. Introduction

Paraphrase identification is the task of determining whethertwo sentences have essentially the same meaning. This task hasbeen shown to play an important role in many natural languageapplications, including text summarization (Barzilay, McKeown,& Elhadad, 1999), question answering (Duboue & Chu-Carroll,2006), machine translation (Callison-Burch, Koehn, & Osborne,2006), natural language generation (Ganitkevitch, Callison-Burch,Napoles, & Van Durme, 2011), and plagiarism detection (Uzuner,Katz, & Nahnsen, 2005). For example, detecting paraphrase sen-tences would help a text summarization system to avoid addingredundant information.

Paraphrase identification is not an easy task. Considering thetwo following sentence pairs, the first sentence pair is a paraphrasealthough the two sentences only share a few words, while the sec-ond one is not a paraphrase even though the two sentences containalmost all the same words.

� ‘‘That would indeed be a great blessing.’’ and ‘‘The Lord had indeedfulfilled his hopes, and answered his prayers.’’� ‘‘Peter usually goes to the cinema with his girlfriend.’’ and ‘‘Peter

never goes to the cinema with his girlfriend.’’

Although the paraphrase identification task is defined in theterm of semantics, it is usually modeled as a binary classificationproblem, which can be solved by training a statistical classifier.Many methods have been proposed for identifying paraphrases.These methods usually employ the similarity between two sen-tences as features, which are computed based on words (Fernando& Stevenson, 2008; Kozareva & Montoyo, 2006; Mihalcea, Corley, &Strapparava, 2006), n-grams (Das & Smith, 2009; Kozareva &Montoyo, 2006), syntactic parse trees (Das & Smith, 2009; Rus,McCarthy, Lintean, McNamara, & Graesser, 2008; Socher, Huang,Pennington, Ng, & Manning, 2011), WordNet (Kozareva & Montoyo,2006; Mihalcea et al., 2006), and MT metrics, the automatedmetrics for evaluation of translation quality (Madnani, Tetreault,& Chodorow, 2012).

Recently, several studies have shown that discourse structuresdeliver important information for paraphrase computation. Forexample, to extract paraphrases, Deléger and Zweigenbaum(2009) match similar paragraphs in comparable texts. Regneriand Wang (2012) extend the distributional hypothesis that entitiesare similar if they share similar contexts at the discourse level.According to them, sentences that play the same role in a certaindiscourse and have a similar discourse context can be paraphrases,even if a semantic similarity model does not consider them verysimilar. Using this assumption, Regneri and Wang (2012) introducea method for collecting paraphrases based on the sequential eventorder in discourse. However, both Deléger and Zweigenbaum(2009) and Regneri and Wang (2012) only consider some specialkinds of data, where the discourse structures can be easilyextracted.

Complete discourse structures such as discourse trees in theRST Discourse Treebank (RST-DT) (Carlson, Marcu, & Okurowski,

http://crossmark.crossref.org/dialog/?doi=10.1016/j.eswa.2013.10.018&domain=pdf

http://dx.doi.org/10.1016/j.eswa.2013.10.018

mailto:[email protected]

http://dx.doi.org/10.1016/j.eswa.2013.10.018

http://www.sciencedirect.com/science/journal/09574174

http://www.elsevier.com/locate/eswa

N.X. Bach et al. / Expert Systems with Applications 41 (2014) 2832–2841 2833

2002) are difficult to extract though they can be very useful forparaphrase computation (Regneri & Wang, 2012). In order to pro-duce such complete discourse structures for a text, we first seg-ment the text into several elementary discourse units (EDUs).Each EDU may be a simple sentence or a clause in a complex sen-tence. Consecutive EDUs are then put in relation with each other tocreate a discourse tree (Mann & Thompson, 1988). An example of adiscourse tree with three EDUs is shown in Fig. 1. Existing fullautomatic discourse parsing systems are neither robust nor veryprecise (Bach, Minh, & Shimazu, 2012b; Joty, Carenini, Ng, &Mehdad, 2013; Regneri & Wang, 2012). In recent years, however,several discourse segmenters with high performance have beenintroduced (Bach, Minh, & Shimazu, 2012a; Hernault, Bollegala, &Ishizuka, 2010; Joty, Carenini, & Ng, 2012). The discourse seg-menter described in Bach et al. (2012a) gives 91.0% in the F1 scoreon the RST-DT corpus when using Stanford parse trees (Klein &Manning, 2003).

In this paper, we present a new method to compute the similar-ity between two sentences based on elementary discourse units(EDU-based similarity). We first segment two sentences into sev-eral EDUs using a discourse segmenter, which is trained on theRST-DT corpus. These EDUs are then employed for computing thesimilarity between two sentences. The key idea is that for eachEDU in one sentence, we try to find the most similar EDU in theother sentence and compute the similarity between them. Weshow how our method can be applied to the paraphrase identifica-tion task. Experimental results on the PAN corpus (Madnani et al.,2012) show that our method is effective for this task. Our work isthe first work that employs discourse units for computing similar-ity as well as for identifying paraphrases.

The rest of this paper is organized as follows. We first presentrelated work in Section 2. Section 3 describes the relation betweenparaphrases and discourse units. Section 4 presents our method,EDU-based similarity. In Section 5, we introduce our discourse seg-mentation system, which will be used to segment sentences intoelementary discourse units. Experiments on the paraphrase identi-fication task are described in Section 6. Section 7 presents sometypes of errors that our method made during experiments. Finally,Section 8 concludes the paper and discusses future work.

2. Related work

There have been many studies on the paraphrase identificationtask. Finch, Hwang, and Sumita (2005) use some MT metrics,including BLEU (Papineni, Roukos, Ward, & Zhu, 2002), NIST(Doddington, 2002), WER (Niessen, Och, Leusch, & Ney, 2000),and PER (Leusch, Ueffing, & Ney, 2003) as features for a SVM clas-sifier. Wan, Dras, Dale, and Paris (2006) combine BLEU featureswith some others extracted from dependency relations and treeedit-distance, and also take SVMs as the learning method to traina binary classifier. Mihalcea et al. (2006) use pointwise mutualinformation, latent semantic analysis, and WordNet to computean arbitrary text-to-text similarity metric. Kozareva and Montoyo(2006) employ features based on longest common subsequence(LSC), skip n-grams, and WordNet. They use a meta-classifier com-posed of SVMs, k-nearest neighbor, and maximum entropy models.

Fig. 1. An example of a discourse tree in RST-DT.

Rus et al. (2008) adapt a graph-based approach (originally devel-oped for recognizing textual entailment) for paraphrase identifica-tion. Fernando and Stevenson (2008) build a matrix of wordsimilarities between all pairs of words in both sentences. Dasand Smith (2009) introduce a probabilistic model which incorpo-rates both syntax and lexical semantics using quasi-synchronousdependency grammars for identifying paraphrases. Socher et al.(2011) describe a joint model that uses the features extracted fromboth single words and phrases in the parse trees of the twosentences.

Most recently, Madnani et al. (2012) present an investigation ofthe impact of MT metrics on the paraphrase identification task.They examine 8 different MT metrics, including BLEU (Papineniet al., 2002), NIST (Doddington, 2002), TER (Snover, Dorr, Schwartz,Micciulla, & Makhoul, 2006), TERP (Snover, Madnani, Dorr, &Schwartz, 2009), METEOR (Denkowski & Lavie, 2010), SEPIA(Habash & Kholy, 2008), BADGER (Parker, 2008), and MAXSIM(Chan & Ng, 2008), and show that a system using nothing but someMT metrics can achieve state-of-the-art results on this task.

Compared with previous work, our work makes some contribu-tions to the advancement of paraphrase identification as follows:

� We present the first work on relations between discourse unitsand paraphrasing, in which discourse units play an importantrole in paraphrasing. We also show that many paraphrase sen-tences can be generated from the original sentences by con-ducting several transformations on discourse units.� We present EDU-based similarity, a new method for computing

the similarity between two sentences based on elementary dis-course units. As shown in the experimental section, our methodis effective for identifying paraphrases.

Discourse structures have only marginally been considered forparaphrase computation. Regneri and Wang (2012) introduce amethod for collecting paraphrases using discourse informationon a special type of data, TV show episodes. With such kind of data,they assume that discourse structures can be achieved by takingsentence sequences of recaps. Our work employs the recent ad-vances in discourse segmentation. Hernault et al. (2010) introducea sequence model for segmenting texts into discourse units usingConditional Random Fields. Joty et al. (2012) present a discoursesegmentation system exploiting a Logistic Regression model withl2 regularization. They learn the model parameters using the L-BFGS algorithm (Zhu, Byrd, Lu, & Nocedal, 1997). Bach et al.(2012a) introduce a reranking model for discourse segmentationusing subtree features. These segmenters achieve relatively highresults on the RST-DT corpus (89.0% in Hernault et al. (2010),90.1% in Joty et al. (2012), and 91.0% in Bach et al. (2012a)).

Unlike previous studies that exploit discourse structure ofspecific data for paraphrase computation, our work introduce ageneral method that is not limited to any specific kind of text.To our knowledge, this is the first work that exploits discourseunits for computing text similarity as well as for identifyingparaphrases.

3. Paraphrases and discourse units

In this section, we describe the relation between paraphrasesand discourse units. We will show that discourse units are blockswhich play an important role in paraphrasing.

Fig. 2 shows an example of a paraphrase sentence pair. In thisexample, the first sentence can be divided into three elementarydiscourse units (EDUs), 1A, 1B, and 1C, and the second sentencecan also be segmented into three EDUs, 2A, 2B, and 2C. Comparingthese six EDUs, we can see that they make three aligned pairs of

Fig. 2. A paraphrase sentence pair in the PAN corpus (Madnani et al., 2012).

Fig. 3. Another paraphrase sentence pair in the PAN corpus.

2834 N.X. Bach et al. / Expert Systems with Applications 41 (2014) 2832–2841

paraphrases: 1A with 2A, 1B with 2B, and 1C with 2C. Therefore, ifwe consider the first sentence is the original sentence, the secondsentence can be created by paraphrasing each discourse unit in theoriginal sentence.

Fig. 3 shows a more complex case. The first sentence consists offour EDUs, 3A, 3B, 3C, and 3D; and the second sentence includesfour EDUs, 4A, 4B, 4C, and 4D. In this case, if we consider the firstsentence is the original one, we have some remarks:

� The discourse unit 4A is a paraphrase of the discourse unit 3B,� The unit 4B is a paraphrase of the combination of two units, 3A

and 3C, and� The combination of two units 4C and 4D is a paraphrase of the

unit 3D.

By analyzing paraphrase sentences, we found that discourseunits are very important to paraphrasing. In many cases, aparaphrase sentence can be created by applying the followingoperations to the original sentence:

1. Reordering two discourse units,2. Combining two discourse units into one unit,3. Dividing one discourse unit into two units, and4. Paraphrasing a discourse unit.

An example of Operation 1 and Operation 2 is the case of units3A, 3B, and 3C in Fig. 3 (reordering 3A and 3B, and then combining3A and 3C). Unit 3D illustrates an example for Operation 3. The lastoperation is the most important operation, which is applied to al-most all of discourse units.

4. EDU-based similarity

Motivated from the analysis of the relation between para-phrases and discourse units, we propose a method to computethe similarity between two sentences. In this section, we assumethat each sentence can be represented as a sequence of elementarydiscourse units (EDUs). The method of segmenting sentences intoEDUs will be presented in Section 5.

First, we present the notion of ordered similarity functions. Giventwo arbitrary texts t1 and t2, an ordered similarity functionSimordered(t1, t2) will return a real score, which measures how t1 issimilar to t2. Note that in this function, the roles of t1 and t2 are dif-ferent, in which t2 can be seen as a gold standard and we want toevaluate t1 based on t2. Examples of ordered similarity functionsare MT metrics, which evaluate how a hypothesis text (t1) is simi-lar to a reference text (t2).

Given an ordered similarity function Simordered, we can definethe similarity between two arbitrary texts t1 and t2 as follows:

Simðt1; t2Þ ¼Simorderedðt1; t2Þ þ Simorderedðt2; t1Þ

2: ð1Þ

Let (s1,s2) be a sentence pair, then s1 and s2 can be representedas sequences of elementary discourse units: s1 = (e1,e2, . . . ,em) ands2 = (f1, f2, . . . , fn), where m and n are the numbers of discourse unitsin s1 and s2, respectively. We define an ordered similarity functionbetween s1 and s2 as follows:

Simorderedðs1; s2Þ ¼Xm

i¼1

Impðei; s1Þ � Simorderedðei; s2Þ; ð2Þ


where Imp(ei,s1) is the importance of the discourse unit ei in thesentence s1, and Simordered(ei,s2) is the ordered similarity betweenthe discourse unit ei and the sentence s2.

In this work, we simply assume that all words contributeequally to the meaning of the sentence. Therefore, the importancefunction can be computed as follows:

Impðei; s1Þ ¼jeijjs1j

; ð3Þ

where jeij and js1j are the lengths (in words) of the discourse unit ei

and the sentence s1, respectively.The ordered similarity Simordered(ei,s2) is computed based on the

discourse unit fj in the sentence s2, which is the most similar to ei:

Simorderedðei; s2Þ ¼ Maxnj¼1Simorderedðei; fjÞ: ð4Þ

Substituting (3) and (4) into (2) we have:

Simorderedðs1; s2Þ ¼Xm

i¼1

jeijjs1j

Maxnj¼1Simorderedðei; fjÞ: ð5Þ

Finally, from (5) and (1) we have the formula for computing thesimilarity between two sentences based on their discourse units(EDU-based similarity), where the ordered similarity betweentwo units is computed directly using the definition of the orderedsimilarity function, as follows:

Simðs1; s2Þ ¼Simorderedðs1; s2Þ þ Simorderedðs2; s1Þ

2

¼ 12�Xm

i¼1

jeijjs1j�Maxn

j¼1Simorderedðei; fjÞ

þ 12�Xn

j¼1

jfjjjs2j�Maxm

i¼1Simorderedðfj; eiÞ: ð6Þ

We now present an example of computing the EDU-basedsimilarity between two sentences in Fig. 2 using the BLEU score.Table 1 shows the basic information of the calculation step by step.

Table 1An example of computing sentence-based and EDU-based similarities. Bold values are use

Line Computation

1 s1: Or his needful holiday has come, and he is staying at a friend’s hous2 s2: Or need a holiday has come, and he stayed in the house of a friend,

Sentence-based similarity

3 BLEU (s1,s2) = 0.53334 BLEU (s2,s1) = 0.53305 Simðs1; s2Þ ¼ BLEUðs1 ;s2ÞþBLEUðs2 ;s1Þ

2 = 0.5332

Discourse units

6 e1: Or his needful holiday has come7 e2: and he is staying at a friend’s house8 e3: or is thrown into new intercourse at some health-resort9 f1: Or need a holiday has come

10 f2: and he stayed in the house of a friend11 f3: or disposed of in a new relationship to a health resort

EDU-based similarity

12 BLEU (e1, f1) = 0.7143 BLEU (e1, f2) = 0.0931 BLEU (e1

13 BLEU (e2, f1) = 0.1818 BLEU (e2, f2) = 0.5455 BLEU (e2

14 BLEU (e3, f1) = 0.0833 BLEU (e3, f2) = 0 BLEU (e3

15 EDU_BLEUðs1; s2Þ ¼ 727 � 0:7143þ 10

27 � 0:5455þ 1027 � 0:4167 = 0.5416

16 BLEU (f1,e1) = 0.7143 BLEU (f1,e2) = 0.1613 BLEU (f1



19 EDU_BLEUðs2; s1Þ ¼ 729 � 0:7143þ 10

29 � 0:5429þ 1229 � 0:4167 = 0.5321

20 EDU_Simðs1; s2Þ ¼ EDU BLEUðs1 ;s2ÞþEDU BLEUðs2 ;s1Þ2 = 0.5369

Line 1 and line 2 present two tokenized sentences and their lengthsin words. Lines 3 through 5 compute the similarity between twosentences directly based on sentences. By using this method, thesimilarity is 0.5332. Elementary discourse units of two sentencesare shown in lines 6 through 11. The computation of EDU-basedsimilarity is described in lines 12 through 20. By using this method,the similarity is 0.5369, which is slightly higher than the similaritycomputed directly using sentences.

5. A model for discourse segmentation

This section presents our model for segmenting sentences intoelementary discourse units. Our model exploits subtree featuresto rerank N-best outputs of a base segmenter, which uses syntacticand lexical features in a CRF framework. We first introduce brieflythe discriminative reranking method in Section 5.1. We then intro-duce our base model in Section 5.2 and our method of extractingsubtree features for the reranking model in Section 5.3.

5.1. Discriminative reranking

In the discriminative reranking method (Collins & Koo, 2005),first, a set of candidates is generated using a base model (GEN).GEN can be any model for the task. In the discourse segmentationtask, we exploit a CRF-based model, that generates N-best outputs.Then, candidates are reranked using a linear score function:

scoreðyÞ ¼ UðyÞ �W;

where y is a candidate, U(y) is the feature vector of candidate y, andW is a parameter vector. The final output is the candidate with thehighest score:

FðxÞ ¼ argmaxy2GENðxÞscoreðyÞ ¼ argmaxy2GENðxÞUðyÞ �W:

To learn the parameter W we use the average perceptron algo-rithm, Algorithm 1, where T is the number of iterations.

d to emphasize important points.

e, or is thrown into new intercourse at some health-resort Length = 27or disposed of in a new relationship to a health resort Length = 29

Length = 7Length = 10Length = 10Length = 7Length = 10Length = 12

, f3) = 0.0699, f3) = 0.0830, f3) = 0.4167

,e3) = 0.0699,e3) = 0,e3) = 0.4167

Fig. 4. Partial lexicalized syntactic parse trees.


Algorithm 1. Average perceptron algorithm for reranking(Collins and Koo, 2005)

1: Inputs: Training set {(xi,yi)jxi 2 Rn,yi 2 C,"i = 1,2, . . . ,m}2: Initialize: W 0,Wavg 03: Define: F(x) = argmaxy2GEN(x)U(y) �W4: for t = 1,2, . . . ,T do5: for i = 1,2, . . . ,m do6: zi F(xi)7: if zi – yi then8: W W + U(yi) �U(zi)9: end if10: Wavg Wavg + W11: end for12: end for13: Wavg Wavg/(mT)14: Output: Parameter vector Wavg.

5.2. Base model

Similar to the work of Hernault et al. (2010), our base modeluses Conditional Random Fields (CRFs)1 (Lafferty, McCallum, &Pereira, 2001) to learn a sequence labeling model. Each label is eitherbeginning of EDU (B) or continuation of EDU (C).

We use the following lexical and syntactic information as fea-tures: words, POS tags, nodes in parse trees and their lexical headsand their POS heads. When extracting features for word w, let r bethe word on the right-hand side of w and Np be the deepest nodethat belongs to both paths from the root to w and r. Nw and Nr

are child nodes of Np that belong to two paths, respectively.Fig. 4 shows two partial lexicalized syntactic parse trees. In the firsttree, if w = says then r = it, Np = VP(says), Nw = VBZ(says), andNr = SBAR(will). We also consider the parent and the right-siblingof Np if any. The final feature set for w consists of not only featuresextracted from w but also features extracted from two words onthe left-hand side and two words on the right-hand side of w.

Our feature extraction method is different from the method inprevious work (Hernault et al., 2010). They define Nw as the highestancestor of w that has lexical head w and has a right-sibling. ThenNp and Nr are defined as the parent and right-sibling of Nw. In thefirst example, our method gives the same results as the previousone. In the second example, however, there is no node with lexicalhead ‘‘done’’ and having a right-sibling. The previous method can-not extract Nw,Np, and Nr in such cases. We also use some new fea-tures such as the head node and the right-sibling node of Np.

5.3. Subtree features for reranking

We need to decide which kinds of subtrees are useful to repre-sent a candidate, a way to segment the input sentence into EDUs.

1 We use the implementation of Kudo (2012).

In our work, we consider two kinds of subtrees: bound trees andsplitting trees.

The bound tree of an EDU, which spans from word u to word w,is a subtree which satisfies two conditions:

1. Its root is the deepest node in the parse tree which belongs toboth paths from the root of the parse tree to u and w; and

2. It only contains nodes in two those paths.

The splitting tree between two consecutive EDUs, from word u toword w and from word r to word v, is a subtree which is similar to abound tree, but contains two paths from the root of the parse treeto w and r. Bound trees will cover the whole EDUs, while splittingtrees will concentrate on the boundaries of EDUs.

From a bound tree (similar to a splitting tree), we extract threekinds of subtrees: subtrees on the left path (left tree), subtrees onthe right path (right tree), and subtrees consisting of a subtree onthe left path and a subtree on the right path (full tree). In the thirdcase, if both subtrees on the left and right paths do not contain theroot node, we add a pseudo root node. Fig. 5 shows the bound treeof EDU ‘‘nothing was done’’ of the second example in Fig. 4, andsome examples of extracted subtrees.

The feature set of a candidate is the set of all subtrees extractedfrom bound trees of all EDUs and splitting trees between two con-secutive EDUs.

Table 2 shows the performance of our discourse segmentationsystem on the test set of RST-DT compared with some state ofthe art discourse segmenters. In all experiments, we set N to 20.To choose the number of iterations T, we used a developmentset, which is about 20 percent of the Training set. Our systemachieved 91.5% precision, 90.4% recall, and 91.0% in the F1 score.

6. Experiments

This section describes our experiments on the paraphrase iden-tification task using EDU-based similarities as features for statisti-cal classifiers. Similar to the work of Madnani et al. (2012), weemployed MT metrics as the ordered similarity functions. How-ever, we computed the MT metrics based on EDUs in addition toMT metrics based on sentences. In all experiments, parse treeswere obtained by using the Stanford parser (Klein & Manning,2003).

6.1. Data and evaluation method

We conducted experiments on the PAN corpus,2 a corpus forparaphrase identification task created from a plagiarism detectioncorpus (Madnani et al., 2012). Table 3 shows the statistics on the

2 The corpus can be downloaded at the address: http://bit.ly/mt-para.

http://bit.ly/mt-para

Fig. 5. Subtree features.

Table 2Performance of the discourse segmentation system. Bold values are used toemphasize important points.

System Precision (%) Recall (%) F1(%)

Hernault et al. (2010) 91.0 87.2 89.0Joty et al. (2012) 88.0 92.3 90.1Our system 91.5 90.4 91.0Human 98.5 98.2 98.3

Table 3PAN corpus for paraphrase identification.

Training Set Test Set

Number of sentence pairs 10,000 3000Number of EDUs per sentence 4.31 4.33Number of words per sentence 40.07 41.12


corpus. The corpus includes a training set of 10,000 sentence pairsand a test set of 3000 sentence pairs.3 On average, each sentence con-tains about 4.3 discourse units, and about 40.1 words in the trainingset, 41.1 words in the test set. We chose this corpus for the followingreasons. First, it is a large corpus for detecting paraphrases. Second, itcontains many long sentences. Our method computes similaritiesbased on discourse units. It is suitable for long sentences with severalEDUs. Lastly, according to Madnani et al. (2012), the PAN corpuscontains many realistic examples of paraphrases.

We evaluated the performance of our paraphrase identificationsystem by accuracy and the F1 score. The accuracy was the percent-age of correct predictions over all the test set, while the F1 scorewas computed only based on the paraphrase sentence pairs.4

6.2. MT metrics

We investigated our method with six different MT metrics (sixtypes of ordered similarity functions). These metrics have beenshown to be effective for the task of paraphrase identification(Madnani et al., 2012).

1. BLEU (Papineni et al., 2002) is the most commonly used MTmetric. It computes the amount of n-gram overlap between ahypothesis text (the output of a translation system) and a refer-ence text.

2. NIST (Doddington, 2002) is a variant of BLEU using the arithme-tic mean of n-gram overlaps. Both BLEU and NIST use exactmatching. They have no concept of synonymy or paraphrasing.

3. TER (Snover et al., 2006) computes the number of edits neededto ‘‘fix’’ the hypothesis text so that it matches the reference text.

3 The training set and the test set were divided exactly the same as in the work ofMadnani et al. (2012).

4 If we consider each sentence pair as an instance with label +1 for paraphrase andlabel �1 for non-paraphrase, the reported F1 score was the F1 score on label +1.

4. TERP (Snover et al., 2009) or TER-Plus is an extension of TER,that utilizes phrasal substitutions, stemming, synonyms, andother improvements.

5. METEOR (Denkowski & Lavie, 2010) is based on the harmonicmean of unigram precision and recall. It also incorporates stem-ming, synonymy, and paraphrase.

6. BADGER (Parker, 2008), a language independent metric, com-putes a compression distance between two sentences usingthe Burrows Wheeler Transformation (BWT).

Among the six MT metrics, TER and TERP compute a translationerror rate between a hypothesis text and a reference text. There-fore, the smaller these MT metrics are, the more similar the twotexts are. When using these metrics in computing EDU-based sim-ilarities, we replaced the max function in Eq. (6) by a min function.

Simðs1; s2Þ ¼Simorderedðs1; s2Þ þ Simorderedðs2; s1Þ

2

¼ 12�Xm

i¼1

jeijjs1j�Minn

j¼1Simorderedðei; fjÞ

þ 12�Xn

j¼1

jfjjjs2j�Minm

i¼1Simorderedðfj; eiÞ:

6.3. Support Vector Machines (SVMs)

This subsection gives a brief introduction to Support Vector Ma-chines (SVMs), and explains why we used SVMs to build binaryclassifiers for identifying paraphrases.

Support Vector Machines (SVMs) are a statistical machinelearning technique proposed by Vapnik (1998). To choose a hyper-plane separating samples in a classification task, SVMs use a strat-egy that maximizes the margin between training samples and thehyperplane. In the cases where we cannot separate training sam-ples linearly (because of some noises in the training data, for exam-ple) we can build the separating hyperplane by allowing somemisclassifications. In those cases, we can build an optimal hyper-plane by introducing a soft margin parameter, which trades off be-tween the training error and the magnitude of the margin (Mohri,Rostamizadeh, & Talwalkar, 2012; Vapnik, 1998).

SVMs can also deal with non-linear classification problems.First, the optimization problem is rewritten into a dual form, inwhich feature vectors only appear in the form of their dot products.By introducing a kernel function K(xi,xj) to substitute the dot prod-uct of xi and xj in the dual form, SVMs can solve non-linear cases.Here are some kinds of kernel functions which are normally used:

� Linear: K(xi,xj) = xi � xj,� Polynomial: K(xi,xj) = (cxi � xj + r)d, c > 0,� Radial basic function (RBF): K(xi,xj) = exp(�ckxi � xjk2), c > 0,� Sigmoid: K(xi,xj) = tanh(cxi � xj + r).

Here, c, r, and d are kernel parameters.

Table 4Experimental results on each individual MT metric.

MT Metric Sentence-based similarities + EDU-based similarities

Accuracy (%) F1 (%) Accuracy (%) F1 (%)

BLEU (1-4) 89.0 88.4 89.6(+0.6) 89.1(+0.7)NIST (1-5) 84.6 82.7 87.6(+3.0) 86.8(+4.1)TER 88.2 87.3 88.5(+0.3) 87.7(+0.4)TERP 91.0 90.6 91.1(+0.1) 90.8(+0.2)METEOR 90.0 89.6 89.8(-0.2) 89.4(-0.2)BADGER 88.1 87.8 88.2(+0.1) 87.8(–)

Table 5Experimental results on combined MT metrics. Bold values are used to emphasizeimportant points.

MT Metrics Accuracy (%) F1 (%)

BLEU 89.6 89.1BLEU + NIST 91.2 90.9BLEU + NIST + TER 91.8 91.6BLEU + NIST + TER + TERP 93.1 93.0Madnani-4 91.5 91.2Madnani-6 92.3 92.1

Table 6Experimental results on long and short sentences. Bold values are used to emphasizeimportant points.

Subset #Sent pairs #EDUs/sent #Words/sent Acc. (%) F1(%)

Subset1 1317 6.5 56.6 96.6 94.8Subset2 1683 2.6 27.2 90.4 92.3


SVMs have been demonstrated their performance on a numberof problems in areas, including computer vision, handwriting rec-ognition, pattern recognition, and statistical natural language pro-cessing. In the field of natural language processing, SVMs havebeen applied to many tasks, including machine translation (Zhang& Li, 2009), topic classification (Wang & Manning, 2012), informa-tion extraction (Boella & Caro, 2013), sentiment analysis (RushdiSaleh, Martín-Valdivia, Montejo-Ráez, & Ureña López, 2011; Proisl,Greiner, Evert, & Kabashi, 2013), discourse parsing (Hernault et al.,2010), and achieved very good results. In fact, SVMs have been alsoexploited successfully to identify paraphrases (Finch et al., 2005;Madnani et al., 2012; Mihalcea et al., 2006; Wan et al., 2006).

6.4. Experimental results with a single SVM classifier

In all experiments presented in this subsection, we chose SVMs(Vapnik, 1998) as the learning method to train a single binaryclassifier.5

First, we investigated each individual MT metric. To see the con-tributions of EDU-based similarities, we conducted experiments intwo settings. In the first setting, we directly applied the MT metricto pairs of sentences to get the similarities (sentence-based simi-larities). In the second setting, we computed EDU-based similari-ties in addition to the sentence-based similarities. Similar to thework of Madnani et al. (2012), in our experiments, we used BLEU1through BLEU4 as 4 different features and NIST1 through NIST5 as5 different features.6 Table 4 shows experimental results in two set-tings on the PAN corpus. We can see that, adding EDU-based similar-ities improved the performance of the paraphrase identificationsystem with most of the MT metrics, especially with NIST (3.0%),BLEU (0.6%), and TER (0.3%).

Table 5 shows experimental results with multiple MT metricson the PAN corpus. With each MT metric, we computed the simi-larities in both methods, based directly on sentences and basedon discourse units. We gradually added MT metrics one by oneto the system. After adding the TERP metric, we achieved 93.1%accuracy and 93.0% in the F1 score. Adding two more MT metrics,METEOR and BADGER, the performance was not improved.

Two last rows of Table 5 shows the results of Madnani et al.(2012) when using 4 MT metrics, including BLEU, NIST, TER, andTERP (Madnani-4) and when using all 6 MT metrics (Madnani-6).7 Compared with the best previous results, our method improves0.8% accuracy and 0.9% in the F1 score. It yields a 10.4% error ratereduction.

We also investigated our method on long and short sentences.We divided sentence pairs in the test set into two subsets: Subset1(long sentences) contains sentence pairs that both sentences have

5 We conducted experiments on LIBSVM tool (Chang & Lin, 2011) with the RBFkernel.

6 BLEUn and NISTn use n-grams.7 Madnani et al. (2012) show that adding more MT metrics does not improve the

performance of the paraphrase identification system.

at least 4 discourse units,8 and Subset2 (short sentences) containsthe other sentence pairs. Table 6 shows the information and exper-imental results on two subsets. Subset1 consists of 1,317 sentencepairs (on average, 6.5 EDUs and 56.6 words per sentence), while Sub-set2 consists of 1,683 sentence pairs (on average, 2.6 EDUs and 27.2words per sentence). We can see that, our method was effective forthe long sentences, which we achieved 96.6% accuracy and 94.8% inthe F1 score compared with 90.4% accuracy and 92.3% in the F1 scoreof the short sentences.

6.5. Revision learning and voting

In Sub Section 6.4, we presented experiments that combine sev-eral MT metrics into a single SVM classifier. In this subsection, weinvestigate the combination of several SVM classifiers (ensemblemodels) building on individual MT metrics, for the paraphraseidentification task. We present experiments with a revision learn-ing model and a maximal voting model. Revision learning and vot-ing are popular and simple, but also powerful methods to produceensemble models. They have been shown to be effective in a num-ber of NLP problems, including word alignment (Wu & Wang,2005), dependency parsing (Attardi & Ciaramita, 2007; Surdeanu& Manning, 2010), named entity translation (Tang, Geva, Trotman,& Xu, 2010), and information extraction (Thomas, Neves,Rocktäschel, & Leser, 2013).

The main idea of two models can be expressed as follows.

1. We first build several classifiers (base models) to identify para-phrases using normal features (MT metrics).

2. We then build a final classifier (revision model or voting model)to judge paraphrase relation based on the outputs of the basemodels in the first step.

In our experiments, we built seven base models using SVMs.The first six models employed six MT metrics (BLEU, NIST, TER,TERP, METOER, and BADGER) as features, respectively. The lastmodel used the best combination of MT metrics, including BLEU,NIST, TER, and TERP. For each MT metric, we computed two typesof similarities, sentence-based similarity and EDU-based similarity.

Our revision learning model is illustrated in Fig. 6. The revisionlearning model was also trained by using SVMs with features as theprobabilities that base models judge the sentence pair is a para-phrase or not. Each base model contributes two features (probabil-ity of paraphrase and probability of non-paraphrase) that yield

8 Number 4 was chosen because on average each sentence contains about 4 EDUs(see Table 3).

Fig. 6. A revision learning model for the paraphrase identification task.


totally fourteen features. To create training data for the revisionmodel, we used a development set, which is about 20% of the train-ing set.

Algorithm 2 describes our voting model. The voting model firstpicks the model that produces the highest probability among sevenbase models. If that probability is higher than a threshold9 (confi-dent score), the output of that model is selected as the output ofthe voting model, otherwise the voting model selects the output ofthe best base model (the seventh base model with combined MTmetrics) as output. The intuitive meaning is that if none of the basemodels gives a confident result, the best base model is a reasonablechoice.

Algorithm 2. A voting algorithm for the paraphraseidentification task

1: Input:� A sentence pair� Seven base models� A threshold T

2: Output: Yes (in the case of paraphrase), No (in the case ofnon-paraphrase)

3: Predict label for the sentence pair using the base models4: Select the base model (called BM) producing the highest

probability (prob)5: if prob P T then6: Return the output of BM7: else8: Return the output of the best base model (using

combined MT metrics)9: end if

Table 7 shows experimental results of the revision learningmodel and the voting model on the PAN corpus. Our revision learn-ing model achieved 93.2% accuracy and 93.1% in the F1 score,which slightly improved the best base model with combined MTmetrics. The voting model achieved the best results with 93.4%accuracy and 93.3% in the F1 score, which improved 0.3% (bothaccuracy and in the F1 score) compared with the best base model,and 1.1% accuracy and 1.2% in the F1 score compared with the pre-vious work of Madnani et al. (2012).

7. Error analysis

This section identifies the cause of the errors that our methodmade on the test data of the PAN corpus, which includes 3,000 sen-tence pairs. Firstly, we wanted to know the statistic information of

9 The threshold was set by using a development set, which is about 20% of thetraining set. It was 0.95 in our experiments.

the experimental results on the test data. We considered the fol-lowing questions:

1. With each sentence pair in the test set, how many modelsamong seven base models produced a correct output?

2. How many sentence pairs were predicted correctly by at leastone base model? And therefore, how many sentence pairs wereunable to be predicted correctly by base models?

Table 8 shows statistic information of the experimental resultson the test set. Among 3,000 sentence pairs, 2,344 sentence pairs(78.1%) were predicted correctly by all seven base models, 226 sen-tence pairs (7.5%) were predicted correctly by six base models, andonly 89 sentence pairs (3.0%) were unable to be predicted correctlyby base models. There are 2911 sentence pairs (97%) that were pre-dicted correctly by at least one base model. The upper bound of ourmethod can therefore be considered as 97%.

7.1. Paraphrases (predicted as non-paraphrases)

This subsection shows three main types of errors in which para-phrase sentence pairs were predicted as non-paraphrases:

1. Complex Sentential ParaphrasesThese sentence pairs are real world paraphrases, where theparaphrase sentences are produced by making several complextransformations and using a lot of new words. Considering twofollowing sentence pairs, in the first case the paraphrase sen-tence is even totally rewritten.� ‘‘‘‘Sukey will be good to him,’’ said Mrs. Lawton, in tones more

gentle than usual.’’ and ‘‘Was it her imagination, or did Mrs.Lawton’s eyes look shifty?’’

� ‘‘A rich man named Fintan was childless, for his wife was barrenfor many years.’’ and ‘‘Wealthy fellow, Fintan, had an infertilewife, so their marriage was a childless one.’’

2. IdiomsThese sentence pairs use idioms that make the meaning very

difficult to understand and therefore difficult to judge the para-phrase relation. Below is such an example.� ‘‘Such an artist, by the very nature of his endeavors, must needs

stand above all public-clapper -clawing, pro or con.’’ and‘‘A true artist must never try to please patrons, clients, or col-leagues but must work on his own inspiration and stand apartfrom the public’s praise or contempt.’’

3. Typographical and Spelling ErrorsThe PAN corpus includes sentence pairs containing typos and

spelling errors that make the system cannot judge correctly.Below is such an example.� ‘‘If I could only git him to move I’d be happier jest ter foller him.’’

and‘‘But still, I would follow him if he ever chose to move on.’’

Table 7Experimental results of the revision learning model and the voting model. Bold valuesare used to emphasize important points.

Model Accuracy (%) F1 (%)

Madnani et al. (2012) 92.3 92.1The best base model (combined MT metric) 93.1(+0.8) 93.0(+0.9)Revision learning 93.2(+0.9) 93.1(+1.0)Voting 93.4(+1.1) 93.3(+1.2)

Table 8Statistic information of the experimental results on the test set. Bold values are usedto emphasize important points.

Number of base model(s) predictedcorrectly

Number of sentencepairs

Percentage

7 2344 78.16 226 7.55 89 3.04 85 2.83 42 1.42 66 2.21 59 2.00 89 3.0


7.2. Non-paraphrases (predicted as paraphrases)

This subsection shows two main types of errors in which non-paraphrase sentence pairs were predicted as paraphrases:

1. Misleading Lexical OverlapThese sentence pairs consist of two sentences which have largelexical overlap. They share a lot of words and contain only a fewdifferent words. However, these few different words make themeaning change. Here are some examples.� ‘‘For catching doves, and other current game, they had inge-

nious little traps.’’ and ‘‘For catching doves, and other smallgame, they had ingenious romantic journeyings.’’

� ‘‘Drawn by Boudier, from a photograph by M. de Morgan.’’ and‘‘Drawn by Boudier, from a photograph by M. Binder.’’

2. ContainingThese sentence pairs consist of two sentences in which one of

them contains the other one but has additional parts. Here issuch an example.� ‘‘His dinner had been put back half an hour!’’ and

‘‘The end of all things was at hand; his dinner had been put backhalf an hour!’’

8. Conclusion and future work

We have presented a study on exploiting discourse informationto identify paraphrases. By introducing a new method for comput-ing text similarity based on discourse units, we showed that dis-course structure provides important information for paraphraseidentification. Unlike previous work, our method was not limitedto any kind of text. The main contributions of our work can besummarized in the following points:

1. We presented the first work on relations between discourseunits and paraphrasing, in which discourse units play an impor-tant role in paraphrasing. We also showed that, in many cases,paraphrase sentences can be generated from the original sen-tences by conducting several transformations.

2. We presented EDU-based similarity, a new method for comput-ing the similarity between two sentences based on elementarydiscourse units. Our method is general and not limited to anykind of text.

3. We applied the method to the task of paraphrase identification.4. We achieved 93.4% accuracy, which improved 1.1% compared

with the previous work of Madnani et al. (2012), in experimentsconducted on the PAN corpus. Experimental results showedthat discourse information was effective for the task of identify-ing paraphrases.

To the best of our knowledge, this is the first work to employdiscourse units for computing similarity as well as for identifyingparaphrases. Although our method was proposed for computingthe similarity between two sentences, it can be also used to com-pute the similarity between two arbitrary texts.

Our work will be beneficial to building many natural languageapplications, including paraphrase generation, text summarization,question answering, and machine translation. A general methodfor paraphrase generation is using machine translation systemsto generate a set of paraphrase sentence pairs. After that, a para-phrase identification system is exploited to select the best para-phrases. Paraphrase generation can be used to rewrite thequestion in a question answering system, or to generate referencesentences for evaluating translation quality. Paraphrase identifica-tion can also help a text summarization system avoid addingredundant information to a summary. In a multi-document sum-marization system, similar sentences in different documents aregrouped into a cluster. Important sentences in clusters will be ex-tracted to form a summary. Each cluster may contain several sen-tences which convey the same information. A paraphraseidentification system can be exploited to remove such redundantsentences. Because our method exploits RST discourse structures,it is not limited to any specific kind of text. It is also language-inde-pendent and requires only MT metric tools.

There are several ways to extend the current work. First, wewould like to investigate our method on other datasets for theparaphrase identification task as well as to other related tasks suchas recognizing textual entailment (Bentivogli, Dagan, Dang,Giampiccolo, & Magnini, 2009) and semantic textual similarity(Agirre, Cer, Diab, & Gonzalez-Agirre, 2012). Second, we plan to ex-ploit the roles of discourse units to improve the method of comput-ing text similarity. In this work, the importance of discourse unitsis calculated based on only their length (i.e., number of words).However, the importance of a discourse unit also depends on itsrole in the text and the relations with other units. Exploiting therelations between discourse units for computing similarity maybe an interesting direction for further research.

References

Agirre, E., Cer, D., Diab, M., & Gonzalez-Agirre, A. (2012). Semeval-2012 task 6: Apilot on semantic textual similarity. In Proceedings of the sixth internationalworkshop on semantic evaluation (SemEval) (pp. 385–393).

Attardi, G., & Ciaramita, M. (2007). Tree revision learning for dependency parsing. InProceedings of the annual conference of the north american chapter of theassociation for computational linguistics (NAACL) (pp. 388–395).

Bach, N. X., Minh, N. L., & Shimazu, A. (2012a). A reranking model for discoursesegmentation using subtree features. In Proceedings of the 13th annual meeting ofthe special interest group on discourse and dialogue (SIGDIAL) (pp. 160–168).

Bach, N. X., Minh, N. L., & Shimazu, A. (2012b). UDRST: A novel system for unlabeleddiscourse parsing in the RST framework. In Proceedings of the eighth internationalconference on natural language processing (JapTAL) (pp. 250–261).

Barzilay, R., McKeown, K. R., & Elhadad, M. (1999). Information fusion in the contextof multi-document summarization. In Proceedings of the 37th annual meeting ofthe association for computational linguistics (ACL) (pp. 550–557).

Bentivogli, L., Dagan, I., Dang, H.T., Giampiccolo, D., & Magnini, B. (2009). The fifthPascal recognizing textual entailment challenge. In Proceedings of text analysisconference (TAC).

Boella, G., & Di Caro, L. (2013). Extracting definitions and hypernym relationsrelying on syntactic dependencies and support vector machines. In Proceedingsof the 51st annual meeting of the association for computational linguistics (ACL)(pp. 532–537).

Callison-Burch, C., Koehn, P., & Osborne, M. (2006). Improved statistical machinetranslation using paraphrases. In Proceedings of the human language technologyconference of the NAACL (HLT-NAACL) (pp. 17–24).


Carlson, L., Marcu, D., & Okurowski, M. E. (2002). RST discourse treebank.Chan, Y. S., & Ng, H. T. (2008). MAXSIM: A maximum similarity metric for machine

translation evaluation. In Proceedings of the 46th annual meeting of theassociation for computational linguistics: Human language technology (ACL-HLT)(pp. 55–62).

Chang, C. C., & Lin, C. J. (2011). LIBSVM: A library for support vector machines. ACMTransactions on Intelligent Systems and Technology, 2(3), 27:1–27:27.

Collins, M., & Koo, T. (2005). Discriminative reranking for natural language parsing.Computational Linguistics, 31(1), 25–70.

Das, D., & Smith, N. A. (2009). Paraphrase identification as probabilistic quasi-synchronous recognition. In Proceedings of the joint conference of the 47th annualmeeting of the ACL and the forth international joint conference on natural languageprocessing of the AFNLP (ACL-IJCNLP) (pp. 468–476).

Deléger, L., & Zweigenbaum, P. (2009). Extracting lay paraphrases of specializedexpressions from monolingual comparable medical corpora. In Proceedings ofthe second workshop on building and using comparable corpora: From parallel tonon-parallel corpora (pp. 2–10).

Denkowski, M., & Lavie, M. (2010). Extending the METEOR machine translationmetric to the phrase level. In Proceedings of the 2010 annual conference of theNorth American chapter of the association for computational linguistics (NAACL)(pp. 250–253).

Doddington, G. (2002). Automatic evaluation of machine translation quality usingN-gram co-occurrence statistics. In Proceedings of the second internationalconference on human language technology research (pp. 138–145).

Duboue, P. A., & Chu-Carroll, J. (2006). Answering the question you wish they hadasked: The impact of paraphrasing for question answering. In Proceedings of thehuman language technology conference of the NAACL (HLT-NAACL) (pp. 33–36).

Fernando, S., & Stevenson, M. (2008). A semantic similarity approach to paraphrasedetection. In Proceedings of the computational linguistics UK (CLUK).

Finch, A., Hwang, Y. S., & Sumita, E. (2005). Using machine translation evaluationtechniques to determine sentence-level semantic equivalence. In Proceedings ofthe third international workshop on paraphrasing (pp. 17–24).

Ganitkevitch, J., Callison-Burch, C., Napoles, C., & Van Durme, B. (2011). Learningsentential paraphrases from bilingual parallel corpora for text-to-textgeneration. In Proceedings of the conference on empirical methods in naturallanguage processing (EMNLP) (pp. 1168–1179).

Habash, N., & Kholy, A. E. (2008). SEPIA: Surface span extension to syntacticdependency precision-based mt evaluation. In Proceedings of the workshop onmetrics for machine translation at AMTA.

Hernault, H., Bollegala, D., & Ishizuka, M. (2010). A sequential model for discoursesegmentation. In Proceedings of the 11th international conference on intelligenttext processing and computational linguistics (CICLING) (pp. 315–326).

Joty, S., Carenini, G., & Ng, R. T. (2012). A novel discriminative framework forsentence-level discourse analysis. In Proceedings of the 2012 joint conference onempirical methods in natural language processing and computational naturallanguage learning (EMNLP-CONLL) (pp. 904–915).

Joty, S., Carenini, G., Ng, R., & Mehdad, Y. (2013). Combining intra- and multi-sentential rhetorical parsing for document-level discourse analysis. InProceedings of the 51st annual meeting of the association for computationallinguistics (ACL) (pp. 486–496).

Klein, D., & Manning, C. (2003). Accurate unlexicalized parsing. In Proceedings of the41st annual meeting of the association for computational linguistics (ACL) (pp.423–430).

Kozareva, Z., & Montoyo, A. (2006). Paraphrase identification on the basis ofsupervised machine learning techniques. In Proceedings of the fifth internationalconference on natural language processing (FinTAL) (pp. 524–533).

Kudo, T. (2012). CRF++: Yet another crf toolkit. <http://crfpp.sourceforge.net/>.Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields:

Probabilistic models for segmenting and labeling sequence data. InProceedings of the eighteenth international conference on machine learning(ICML) (pp. 282–289).

Leusch, G., Ueffing, N., & Ney, H. (2003). A novel string-to-string distance measurewith applications to machine translation evaluation. In Proceedings of the ninthmachine translation summit (MT Summit IX).

Madnani, N., Tetreault, J., & Chodorow, M. (2012). Re-examining machinetranslation metrics for paraphrase identification. In Proceedings of the 2012conference of the north american chapter of the association for computationallinguistics: Human language technologies (NAACL-HLT) (pp. 182–190).

Mann, W. C., & Thompson, S. A. (1988). Rhetorical structure theory. Toward afunctional theory of text organization (pp. 243–281).

Mihalcea, R., Corley, C., & Strapparava, C. (2006). Corpus-based and knowledge-based measures of text semantic similarity. In Proceedings of the twenty-firstnational conference on artificial intelligence (AAAI) (pp. 775–780).

Mohri, M., Rostamizadeh, A., & Talwalkar, A. (2012). Foundations of machine learning.MIT Press.

Niessen, S., Och, F., Leusch, G., & Ney, H. (2000). An evaluation tool for machinetranslation: Fast evaluation for mt research. In Proceedings of the secondinternational conference on language resources and evaluation (LREC).

Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). BLEU: A method for automaticevaluation of machine translation. In Proceedings of the 40th annual meeting ofthe association for computational linguistics (ACL) (pp. 311–318).

Parker, S. (2008). BADGER: A new machine translation metric. In Proceedings of theworkshop on metrics for machine translation at AMTA.

Proisl, T., Greiner, P., Evert, S., & Kabashi, B. (2013). Klue: Simple and robustmethods for polarity classification. In Proceedings of the seventh internationalworkshop on semantic evaluation (SemEval 2013) (pp. 395–401).

Regneri, M., Wang, R. (2012). Using discourse information for paraphraseextraction. In Proceedings of the 2012 joint conference on empirical methods innatural language processing and computational natural language learning (EMNLP-CONLL) (pp. 916–927).

Rus, V., McCarthy, P. M., Lintean, M. C., McNamara, D. S., & Graesser, A. C. (2008).Paraphrase identification with lexico-syntactic graph subsumption. InProceedings of the twenty-first international florida artificial intelligence researchsociety conference (FLAIRS) (pp. 201–206).

Rushdi Saleh, M., Martín-Valdivia, M. T., Montejo-Ráez, A., & Ureña López, L. A.(2011). Experiments with svm to classify opinions in different domains. ExpertSystems with Applications, 38(12), 14799–14804.

Snover, M., Dorr, B., Schwartz, R., Micciulla, L., & Makhoul, J. (2006). A study oftranslation edit rate with targeted human annotation. In Proceedings of theconference of the association for machine translation in the Americas (AMTA).

Snover, M., Madnani, N., Dorr, B., & Schwartz, R. (2009). TER-Plus: Paraphrase,semantic, and alignment enhancements to translation edit rate. MachineTranslation, 23(23), 117–127.

Socher, R., Huang, E. H., Pennington, J., Ng, A. Y., & Manning, C. D. (2011). Dynamicpooling and unfolding recursive autoencoders for paraphrase detection. InAdvances in neural information processing systems 24 (NIPS) (pp. 801–809).

Surdeanu, M., & Manning, C. (2010). Ensemble models for dependency parsing:Cheap and good? In Proceedings of the 2010 annual conference of the NorthAmerican chapter of the association for computational linguistics (NAACL) (pp.649–652).

Tang, L., Geva, S., Trotman, A., Xu, Y. (2010). A voting mechanism for named entitytranslation in english-chinese question answering. In Proceedings of the 4thinternational workshop on cross lingual information access at COLING 2010 (pp.43–51).

Thomas, P., Neves, M., Rocktäschel, T., Leser, U. (2013). Wbi-ddi: Drug-druginteraction extraction using majority voting. In Proceedings of the seventhinternational workshop on semantic evaluation (SemEval 2013) (pp. 628–635).

Uzuner, O., Katz, B., & Nahnsen, T. (2005). Using syntactic information to identifyplagiarism. In Proceedings of the second workshop on building educationalapplications using natural language processing (pp. 37–44).

Vapnik, V. N. (1998). Statistical learning theory. Wiley-Interscience.Wan, S., Dras, R., Dale, M., & Paris, C. (2006). Using dependency-based features to

take the Para-farce out of paraphrase. In Proceedings of the 2006 Australasianlanguage technology workshop (pp. 131–138).

Wang, S., & Manning, C. (2012). Baselines and bigrams: Simple, good sentiment andtopic classification. In Proceedings of the 50th annual meeting of the associationfor computational linguistics (ACL) (pp. 90–94).

Wu, H., & Wang, H. (2005). Improving statistical word alignment with ensemblemethods. In Proceedings of the international joint conference on natural languageprocessing (IJCNLP) (pp. 462–473).

Zhang, M., & Li, H. (2009). Tree kernel-based svm with structured syntacticknowledge for btg-based phrase reordering. In Proceedings of the conference onempirical methods in natural language processing (EMNLP) (pp. 698–707).

Zhu, C., Byrd, R. H., Lu, P., & Nocedal, J. (1997). Algorithm 778: L-BFGS-B: Fortransubroutines for large-scale bound-constrained optimization. ACM Transactionson Mathematical Software (TOMS), 23(4), 550–560.

http://refhub.elsevier.com/S0957-4174(13)00834-8/h0005




http://crfpp.sourceforge.net/













Documents

Exploiting discourse information to identify paraphrases