Extracting Lexically Divergent Paraphrases from Twitter · PDF fileExtracting Lexically Divergent Paraphrases from Twitter Wei Xu 1, Alan Ritter 2, Chris Callison-Burch 1, William

Extracting Lexically Divergent Paraphrases from Twitter

Wei Xu1, Alan Ritter2, Chris Callison-Burch1, William B. Dolan3 and Yangfeng Ji4

1 University of Pennsylvania, Philadelphia, PA, USA{xwe, ccb}@cis.upenn.edu

2 The Ohio State University, Columbus, OH, [email protected]

3 Microsoft Research, Redmond, WA, [email protected]

4 Georgia Institute of Technology, Atlanta, GA, [email protected]

Abstract

We present MULTIP (Multi-instance Learn-ing Paraphrase Model), a new model suitedto identify paraphrases within the short mes-sages on Twitter. We jointly model para-phrase relations between word and sentencepairs and assume only sentence-level annota-tions during learning. Using this principled la-tent variable model alone, we achieve the per-formance competitive with a state-of-the-artmethod which combines a latent space modelwith a feature-based supervised classifier. Ourmodel also captures lexically divergent para-phrases that differ from yet complement previ-ous methods; combining our model with pre-vious work significantly outperforms the state-of-the-art. In addition, we present a novel an-notation methodology that has allowed us tocrowdsource a paraphrase corpus from Twit-ter. We make this new dataset available to theresearch community.

1 Introduction

Paraphrases are alternative linguistic expressions ofthe same or similar meaning (Bhagat and Hovy,2013). Twitter engages millions of users, who nat-urally talk about the same topics simultaneouslyand frequently convey similar meaning using diverselinguistic expressions. The unique characteristicsof this user-generated text presents new challengesand opportunities for paraphrase research (Xu et al.,2013b; Wang et al., 2013). For many applications,like automatic summarization, first story detection(Petrovic et al., 2012) and search (Zanzotto et al.,2011), it is crucial to resolve redundancy in tweets

(e.g. oscar nom’d doc ↔ Oscar-nominated docu-mentary).

In this paper, we investigate the task of determin-ing whether two tweets are paraphrases. Previouswork has exploited a pair of shared named entitiesto locate semantically equivalent patterns from re-lated news articles (Shinyama et al., 2002; Sekine,2005; Zhang and Weld, 2013). But short sentencesin Twitter do not often mention two named entities(Ritter et al., 2012) and require nontrivial general-ization from named entities to other words. For ex-ample, consider the following two sentences aboutbasketball player Brook Lopez from Twitter:

◦ That boy Brook Lopez with a deep 3◦ brook lopez hit a 3 and i missed it

Although these sentences do not have many words incommon, the identical word “3” is a strong indicatorthat the two sentences are paraphrases.

We therefore propose a novel joint word-sentenceapproach, incorporating a multi-instance learningassumption (Dietterich et al., 1997) that two sen-tences under the same topic (we highlight topics inbold) are paraphrases if they contain at least oneword pair (we call it an anchor and highlight withunderscores; the words in the anchor pair need notbe identical) that is indicative of sentential para-phrase. This at-least-one-anchor assumption mightbe ineffective for long or randomly paired sentences,but holds up better for short sentences that are tem-porally and topically related on Twitter. Moreover,our model design (see Figure 1) allows exploitationof arbitrary features and linguistic resources, suchas part-of-speech features and a normalization lex-

435

Transactions of the Association for Computational Linguistics, 2 (2014) 435–448. Action Editor: Sharon Goldwater.Submitted 8/2014; Revised 10/2014; Published 10/2014. c©2014 Association for Computational Linguistics.

(a)

(b)

Figure 1: (a) a plate representation of the MULTIP model (b) an example instantiation of MULTIP for thepair of sentences “Manti bout to be the next Junior Seau” and “Teo is the little new Junior Seau”, in whicha new American football player Manti Te’o was being compared to a famous former player Junior Seau.Only 4 out of the total 6× 5 word pairs, z1 - z30, are shown here.

icon, to discriminatively determine word pairs asparaphrastic anchors or not.

Our graphical model is a major departure frompopular surface- or latent- similarity methods (Wanet al., 2006; Guo and Diab, 2012; Ji and Eisenstein,2013, and others). Our approach to extract para-phrases from Twitter is general and can be com-bined with various topic detecting solutions. As ademonstration, we use Twitter’s own trending topicservice1 to collect data and conduct experiments.While having a principled and extensible design,our model alone achieves performance on par witha state-of-the-art ensemble approach that involvesboth latent semantic modeling and supervised classi-fication. The proposed model also captures radicallydifferent paraphrases from previous approaches; acombined system shows significant improvementover the state-of-the-art.

This paper makes the following contributions:

1) We present a novel latent variable model forparaphrase identification, that specifically ac-commodates the very short context and di-vergent wording in Twitter data. We exper-imentally compare several representative ap-proaches and show that our proposed method

1More information about Twitter’s trends:https://support.twitter.com/articles/101125-faqs-about-twitter-s-trends

yields state-of-the-art results and identifiesparaphrases that are complementary to previ-ous methods.

2) We develop an efficient crowdsourcing methodand construct a Twitter Paraphrase Corpus ofabout 18,000 sentence pairs, as a first commontestbed for the development and comparison ofparaphrase identification and semantic similar-ity systems. We make this dataset available tothe research community.2

2 Joint Word-Sentence Paraphrase Model

We present a new latent variable model that jointlycaptures paraphrase relations between sentence pairsand word pairs. It is very different from previous ap-proaches in that its primary design goal and motiva-tion is targeted towards short, lexically diverse texton the social web.

2.1 At-least-one-anchor AssumptionMuch previous work on paraphrase identificationhas been developed and evaluated on a specificbenchmark dataset, the Microsoft Research Para-phrase Corpus (Dolan et al., 2004), which is de-

2The dataset and code are made available at: SemEval-2015shared task http://alt.qcri.org/semeval2015/task1/ and https://github.com/cocoxu/twitterparaphrase/

436

Corpus Examples

News

◦ Revenue in the first quarter of the year dropped 15 percent from the same perioda year earlier.◦With the scandal hanging over Stewart’s company, revenue in the first quarter ofthe year dropped 15 percent from the same period a year earlier.

(Dolan andBrockett,

2005)

◦ The Senate Select Committee on Intelligence is preparing a blistering report onprewar intelligence on Iraq.◦ American intelligence leading up to the war on Iraq will be criticized by a pow-erful US Congressional committee due to report soon, officials said today.◦ Can Klay Thompson wake up◦ Cmon Klay need u to get it going

Twitter(This Work)

◦ Ezekiel Ansah wearing 3D glasses wout the lens◦Wait Ezekiel ansah is wearing 3d movie glasses with the lenses knocked out◦Marriage equality law passed in Rhode Island◦ Congrats to Rhode Island becoming the 10th state to enact marriage equality

Table 1: Representative examples from paraphrase corpora. The average sentence length is 11.9 words inTwitter vs. 18.6 in the news corpus.

rived from news articles. Twitter data is very dif-ferent, as shown in Table 1. We observe that amongtweets posted around the same time about the sametopic (e.g. a named entity), sentential paraphrasesare short and can often be “anchored” by lexicalparaphrases. This intuition leads to the at-least-one-anchor assumption we stated in the introduction.

The anchor could be a word the two sentencesshare in common. It also could be a pair of differentwords. For example, the word pair “next ‖ new” intwo tweets about a new player Manti Te’o to a fa-mous former American football player Junior Seau:

◦ Manti bout to be the next Junior Seau◦ Teo is the little new Junior Seau

Further note that not every word pair of similarmeaning indicates sentence-level paraphrase. Forexample, the word “3”, shared by two sentencesabout movie “Iron Man” that refers to the 3rd se-quel of the movie, is not a paraphrastic anchor:

◦ Iron Man 3 was brilliant fun◦ Iron Man 3 tonight see what this is like

Therefore, we use a discriminative model at theword-level to incorporate various features, such aspart-of-speech features, to determine how probablea word pair is a paraphrase anchor.

2.2 Multi-instance Learning Paraphrase Model(MULTIP)

The at-least-one-anchor assumption naturally leadsto a multi-instance learning problem (Dietterichet al., 1997), where the learner only observes labelson bags of instances (i.e. sentence-level paraphrasesin this case) instead of labels on each individual in-stance (i.e. word pair).

We formally define an undirected graphical modelof multi-instance learning for paraphrase identifica-tion – MULTIP. Figure 1 shows the proposed modelin plate form and gives an example instantiation.The model has two layers, which allows joint rea-soning between sentence-level and word-level com-ponents.

For each pair of sentences si = (si1 , si2), thereis an aggregate binary variable yi that representswhether they are paraphrases, and which is observedin the labeled training data. Let W (sik) be the setof words in the sentence sik , excluding the topicnames. For each word pair wj = (wj1 , wj2) ∈W (si1) ×W (si2), there exists a latent variable zjwhich denotes whether the word pair is a paraphraseanchor. In total there are m = |W (si1)| × |W (si2)|word pairs, and thus zi = z1, z2, ..., zj , ..., zm.Our at-least-one-anchor assumption is realized bya deterministic-or function; that is, if there exists atleast one j such that zj = 1, then the sentence pair

437

is a paraphrase.Our conditional paraphrase identification model is

defined as follows:

P (zi, yi|wi; θ) =m∏

j=1

φ(zj , wj ; θ)× σ(zi, yi)

=

m∏

j=1

exp(θ · f(zj , wj))× σ(zi, yi)(1)

where f(zj , wj) is a vector of features extracted forthe word pair wj , θ is the parameter vector, and σis the factor that corresponds to the deterministic-orconstraint:

σ(zi, yi) =

1 if yi = true ∧ ∃j : zj = 1

1 if yi = false ∧ ∀j : zj = 0

0 otherwise

(2)

2.3 LearningTo learn the parameters of the word-level paraphraseanchor classifier, θ, we maximize likelihood over thesentence-level annotations in our paraphrase corpus:

θ∗ = argmaxθ

P (y|w; θ)

= argmaxθ

∏

i

∑

zi

P (zi, yi|wi; θ)(3)

An iterative gradient-ascent approach is used toestimate θ using perceptron-style additive updates(Collins, 2002; Liang et al., 2006; Zettlemoyer andCollins, 2007; Hoffmann et al., 2011). We define anupdate based on the gradient of the conditional loglikelihood using Viterbi approximation, as follows:

∂ logP (y|w; θ)

∂θ= EP (z|w,y;θ)(

∑

i

f(zi,wi))

− EP (z,y|w;θ)(∑

i

f(zi,wi))

≈∑

i

f(z∗i ,wi)−∑

i

f(z′i,wi)

(4)

where we define the feature sum for each sentencef(zi,wi) =

∑j f(zj , wj) over all word pairs.

These two above expectations are approximatedby solving two simple inference problems as maxi-mizations:

z∗ = argmaxz

P (z|w, y; θ)

y′, z′ = argmaxy,z

P (z, y|w; θ)(5)

Input: a training set {(si, yi)|i = 1...n}, where i isan index corresponding to a particular sentencepair si, and yi is the training label.

1: initialize parameter vector θ ← 02: for i← 1 to n do3: extract all possible word pairs wi =

w1, w2, ..., wm and their features from thesentence pair si

4: end for5: for l← 1 to maximum iterations do6: for i← 1 to n do7: (y′i, z′i)← argmaxyi,zi P (zi, yi|wi; θ)8: if y′i 6= yi then9: z∗i ← argmaxzi P (zi|wi, yi; θ)

10: θ ← θ + f(z∗i ,wi)− f(z′i,wi)11: end if12: end for13: end for14: return model parameters θ

Figure 2: MULTIP Learning Algorithm

Computing both z′ and z∗ are rather straightfor-ward under the structure of our model and can besolved in time linear in the number of word pairs.The dependencies between z and y are defined asdeterministic-or factors σ(zi, yi), which when sat-isfied do not affect the overall probability of thesolution. Each sentence pair is independent con-ditioned on the parameters. For z′, it is sufficientto independently compute the most likely assign-ment z′i for each word pair, ignoring the determin-istic dependencies. y′i is then set by aggregatingall z′i through the deterministic-or operation. Sim-ilarly, we can find the exact solution for z∗, themost likely assignment that respects the sentence-level training label y. For a positive training in-stance, we simply find its highest scored word pairwτ by the word-level classifier, then set z∗τ = 1 andz∗j = argmaxx∈0,1 φ(x,wj ; θ) for all j 6= τ ; fora negative example, we set z∗i = 0. The time com-plexity of both inferences for one sentence pair isO(|W (s)|2), where |W (s)|2 is the number of wordpairs.

In practice, we use online learning instead of opti-mizing the full objective. The detailed learning algo-rithm is presented in Figure 2. Following Hoffmannet al. (2011), we use 50 iterations in the experiments.

438

2.4 Feature Design

At the word-level, our discriminative model allowsuse of arbitrary features that are similar to those inmonolingual word alignment models (MacCartneyet al., 2008; Thadani and McKeown, 2011; Yaoet al., 2013a,b). But unlike discriminative mono-lingual word alignment, we only use sentence-leveltraining labels instead of word-level alignmentannotation. For every word pair, we extract thefollowing features:

String Features that indicate whether the twowords, their stemmed forms and their normalizedforms are the same, similar or dissimilar. Weused the Morpha stemmer (Minnen et al., 2001),3

Jaro-Winkler string similarity (Winkler, 1999) andthe Twitter normalization lexicon by Han et al.(2012).

POS Features that are based on the part-of-speechtags of the two words in the pair, specifying whetherthe two words have same or different POS tagsand what the specific tags are. We use the TwitterPart-Of-Speech tagger developed by Derczynskiet al. (2013). We add new fine-grained tags forvariations of the eight words: “a”, “be”, “do”,“have”, “get”, “go”, “follow” and “please”. Forexample, we use a tag HA for words “have”, “has”and “had”.

Topical Features that relate to the strength ofa word’s association to the topic. This featureidentifies the popular words in each topic, e.g. “3”in tweets about basketball game, “RIP” in tweetsabout a celebrity’s death. We use G2 log-likelihood-ratio statistic, which has been frequently used inNLP, as a measure of word associations (Dunning,1993; Moore, 2004). The significant scores arecomputed for each trend on an average of about1500 sentences and converted to binary features forevery word pair, indicating whether the two wordsare both significant or not.

Our topical features are novel and were notused in previous work. Following Riedel et al.(2010) and Hoffmann et al. (2011), we also incor-porate conjunction features into our system for bet-ter accuracy, namely Word+POS, Word+Topical andWord+POS+Topical features.

3https://github.com/knowitall/morpha

3 Experiments

3.1 Data

It is nontrivial to gather a gold-standard datasetof naturally occurring paraphrases and non-paraphrases efficiently from Twitter, since thisrequires pairwise comparison of tweets and facesa very large search space. To make this annotationtask tractable, we design a novel and efficientcrowdsourcing method using Amazon MechanicalTurk. Our entire data collection process is de-tailed in Section §4, with several experiments thatdemonstrate annotation quality and efficiency.

In total, we constructed a Twitter Paraphrase Cor-pus of 18,762 sentence pairs and 19,946 unique sen-tences. The training and development set consistsof 17,790 sentence pairs posted between April 24thand May 3rd, 2014 from 500+ trending topics (ex-cluding hashtags). Our paraphrase model and datacollection approach is general and can be combinedwith various Twitter topic detecting solutions (Diaoet al., 2012; Ritter et al., 2012). As a demonstra-tion, we use Twitter’s own trends service since itis easily available. Twitter trending topics are de-termined by an unpublished algorithm, which findswords, phrases and hashtags that have had a sharpincrease in popularity, as opposed to overall vol-ume. We use case-insensitive exact matching to lo-cate topic names in the sentences.

Each sentence pair was annotated by 5 differentcrowdsourcing workers. For the test set, we ob-tained both crowdsourced and expert labels on 972sentence pairs from 20 randomly sampled Twittertrending topics between May 13th and June 10th.Our dataset is more realistic and balanced, contain-ing 79% non-paraphrases vs. 34% in the benchmarkMicrosoft Paraphrase Corpus of news data. As notedin (Das and Smith, 2009), the lack of natural non-paraphrases in the MSR corpus creates bias towardscertain models.

3.2 Baselines

We use four baselines to compare with our proposedapproach for the sentential paraphrase identificationtask. For the first baseline, we choose a super-vised logistic regression (LR) baseline used by Dasand Smith (2009). It uses simple n-gram (also instemmed form) overlapping features but shows very

439

Method F1 Precision RecallRandom 0.294 0.208 0.500WTMF (Guo and Diab, 2012)* 0.583 0.525 0.655LR (Das and Smith, 2009)** 0.630 0.629 0.632LEXLATENT 0.641 0.663 0.621LEXDISCRIM (Ji and Eisenstein, 2013) 0.645 0.664 0.628MULTIP 0.724 0.722 0.726Human Upperbound 0.823 0.752 0.908

Table 2: Performance of different paraphrase identification approaches on Twitter data. *An enhancedversion that uses additional 1.6 million sentences from Twitter. ** Reimplementation of a strong baselineused by Das and Smith (2009).

competitive performance on the MSR corpus.The second baseline is a state-of-the-art unsu-

pervised method, Weighted Textual Matrix Factor-ization (WTMF),4 which is specially developed forshort sentences by modeling the semantic space ofboth words that are present in and absent fromthe sentences (Guo and Diab, 2012). The origi-nal model was learned from WordNet (Fellbaum,2010), OntoNotes (Hovy et al., 2006), Wiktionary,the Brown corpus (Francis and Kucera, 1979). Weenhance the model with 1.6 million sentences fromTwitter as suggested by Guo et al. (2013).

Ji and Eisenstein (2013) presented a state-of-the-art ensemble system, which we call LEXDIS-CRIM.5 It directly combines both discriminatively-tuned latent features and surface lexical features intoa SVM classifier. Specifically, the latent representa-tion of a pair of sentences ~v1 and ~v2 is converted intoa feature vector, [~v1+ ~v2, |~v1− ~v2|], by concatenatingthe element-wise sum ~v1 + ~v2 and absolute different|~v1 − ~v2|.

We also introduce a new baseline, LEXLATENT,which is a simplified version of LEXDISCRIM andeasy to reproduce. It uses the same method to com-bine latent features and surface features, but com-bines the open-sourced WTMF latent space modeland the logistic regression model from above in-stead. It achieves similar performance as LEXDIS-CRIM on our dataset (Table 2).

4The source code and data for WTMF is available at:http://www.cs.columbia.edu/˜weiwei/code.html

5The parsing feature was removed because it was not helpfulon our Twitter dataset.

3.3 System Performance

For evaluation of different systems, we computeprecision-recall curves and report the highest F1measure of any point on the curve, on the test datasetof 972 sentence pairs against the expert labels. Ta-ble 2 shows the performance of different systems.Our proposed MULTIP, a principled latent variablemodel alone, achieves competitive results with thestate-of-the-art system that combines discriminativetraining and latent semantics.

In Table 2, we also show the agreement levels oflabels derived from 5 non-expert annotations on Me-chanical Turk, which can be considered as an up-perbound for automatic paraphrase recognition taskperformed on this data set. The annotation qualityof our corpus is surprisingly good given the fact thatthe definition of paraphrase is rather inexact (Bhagatand Hovy, 2013); the inter-rater agreement betweenexpert annotators on news data is only 0.83 as re-ported by Dolan et al. (2004).

F1 Prec RecallMULTIP 0.724 0.722 0.726- String features 0.509 0.448 0.589- POS features 0.496 0.350 0.851- Topical features 0.715 0.694 0.737

Table 3: Feature ablation by removing each individ-ual feature group from the full set.

To assess the impact of different features on themodel’s performance, we conduct feature ablationexperiments, removing one group of features at atime. The results are shown in Table 3. Both string

440

Para? Sentence Pair from Twitter MULTIP LEXLATENTYES ◦ The new Ciroc flavor has arrived rank=12 rank=266

◦ Ciroc got a new flavor comin outYES ◦ Roberto Mancini gets the boot from Man City rank=64 rank=452

◦ Roberto Mancini has been sacked by Manchester Citywith the Blues saying

YES ◦ I want to watch the purge tonight rank=136 rank=11◦ I want to go see The Purge who wants to come with

NO ◦ Somebody took the Marlins to 20 innings rank= 8 rank=54◦ Anyone who stayed 20 innings for the marlins

NO ◦WORLD OF JENKS IS ON AT 11 rank=167 rank=9◦World of Jenks is my favorite show on tv

Table 4: Example system outputs; rank is the position in the list of all candidate paraphrase pairs in the testset ordered by model score. MULTIP discovers lexically divergent paraphrases while LEXLATENT prefersmore overall sentence similarity. Underline marks the word pair(s) with highest estimated probability asparaphrastic anchor(s) for each sentence pair.

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Recall

Precision

MULTIPLEXLATENT

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Recall

Precision

MULTIP−PELEXLATENT

Figure 3: Precision and recall curves. Our MULTIP model alone achieves competitive performance with theLEXLATENT system that combines latent space model and feature-based supervised classifier. The twoapproaches have complementary strengths, and achieves significant improvement when combined together(MULTIP-PE).

and POS features are essential for system perfor-mance, while topical features are helpful but not ascrucial.

Figure 3 presents precision-recall curves andshows the sensitivity and specificity of each modelin comparison. In the first half of the curve (recall< 0.5), MULTIP model makes bolder and less ac-curate decisions than LEXLATENT. However, thecurve for MULTIP model is more flat and shows con-

sistently better precision at the second half (recall >0.5) as well as a higher maximum F1 score. This re-sult reflects our design concept of MULTIP, which isintended to pick up sentential paraphrases with moredivergent wordings aggressively. LEXLATENT, asa combined system, considers sentence features inboth surface and latent space and is more conserva-tive. Table 4 further illustrates this difference withsome example system outputs.

441

3.4 Product of Experts (MULTIP-PE)Our MULTIP model and previous similarity-basedapproaches have complementary strengths, so weexperiment with combining MULTIP (Pm) andLEXLATENT (Pl) through a product of experts(Hinton, 2002):

P (y|s1, s2) =Pm(y|s1, s2)× Pl(y|s1, s2)∑y Pm(y|s1, s2)× Pl(y|s1, s2)

(6)

The resulting system MULTIP-PE provides con-sistently better precision and recall over theLEXLATENT model, as shown on the right inFigure 3. The MULTIP-PE system outperformsLEXLATENT significantly according to a paired t-test with ρ less than 0.05. Our proposed MUL-TIP takes advantage of Twitter’s specific propertiesand provides complementary information to previ-ous approaches. Previously, Das and Smith (2009)has also used a product of experts to combine a lex-ical and a syntax-based model together.

4 Constructing Twitter ParaphraseCorpus

We now turn to describing our data collection andannotation methodology. Our goal is to construct ahigh quality dataset that contains representative ex-amples of paraphrases and non-paraphrases in Twit-ter. Since Twitter users are free to talk about any-thing regarding any topic, a random pair of sen-tences about the same topic has a low chance (lessthan 8%) of expressing the same meaning. Thiscauses two problems: a) it is expensive to obtainparaphrases via manual annotation; b) non-expertannotators tend to loosen the criteria and are morelikely to make false positive errors. To addressthese challenges, we design a simple annotation taskand introduce two selection mechanisms to selectsentences which are more likely to be paraphrases,while preserving diversity and representativeness.

4.1 Raw Data from TwitterWe crawl Twitter’s trending topics and their associ-ated tweets using public APIs.6 According to Twit-ter, trends are determined by an algorithm which

6More information about Twitter’s APIs: https://dev.twitter.com/docs/api/1.1/overview

expert=0

expert=1

expert=2

expert=3

expert=4

expert=5

turk

=0tu

rk=1

turk

=2tu

rk=3

turk

=4tu

rk=5

Figure 4: A heat-map showing overlap betweenexpert and crowdsourcing annotation. The inten-sity along the diagonal indicates good reliability ofcrowdsourcing workers for this particular task; andthe shift above the diagonal reflects the differencebetween the two annotation schemas. For crowd-sourcing (turk), the numbers indicate how many an-notators out of 5 picked the sentence pair as para-phrases; 0,1 are considered non-paraphrases; 3,4,5are paraphrases. For expert annotation, all 0,1,2are non-paraphrases; 4,5 are paraphrases. Medium-scored cases are discarded in training and testing inour experiments.

identifies topics that are immediately popular, ratherthan those that have been popular for longer periodsof time or which trend on a daily basis. We tokenizeand split each tweet into sentences.7

4.2 Task Design on Mechanical TurkWe show the annotator an original sentence, thenask them to pick sentences with the same mean-ing from 10 candidate sentences. The original andcandidate sentences are randomly sampled from thesame topic. For each such 1 vs. 10 question, we ob-tain binary judgements from 5 different annotators,paying each annotator $0.02 per question. On aver-age, each question takes one annotator about 30 ∼45 seconds to answer.

4.3 Annotation QualityWe remove problematic annotators by checkingtheir Cohen’s Kappa agreement (Artstein and Poe-

7We use the toolkit developed by O’Connor et al. (2010):https://github.com/brendano/tweetmotif

442

Reggie MillerAmber

Robert WoodsCandice

The ClippersRyu

Jeff GreenHarvick

MilwaukeeKlay

HuckMorning

Momma DeeDortmund

RonaldoNetflixGWB

Dwight HowardFacebook

U.S.

0.0 0.2 0.4 0.6 0.8

filteredrandom

Percentage of Positive Judgements

Tren

ding

Top

ics

Figure 5: The proportion of paraphrases (percentage of positive votes from annotators) vary greatly acrossdifferent topics. Automatic filtering in Section 4.4 roughly doubles the paraphrase yield.

sio, 2008) with other annotators. We also computeinter-annotator agreement with an expert annotatoron 971 sentence pairs. In the expert annotation, weadopt a 5-point Likert scale to measure the degreeof semantic similarity between sentences, which isdefined by Agirre et al. (2012) as follows:

5: Completely equivalent, as they mean the samething;

4: Mostly equivalent, but some unimportant detailsdiffer;

3: Roughly equivalent, but some important informa-tion differs/missing.

2: Not equivalent, but share some details;1: Not equivalent, but are on the same topic;0: On different topics.

Although the two scales of expert and crowd-sourcing annotation are defined differently, theirPearson correlation coefficient reaches 0.735 (two-tailed significance 0.001). Figure 4 shows a heat-map representing the detailed overlap between thetwo annotations. It suggests that the graded simi-larity annotation task could be reduced to a binarychoice in a crowdsourcing setup.

4.4 Automatic Summarization InspiredSentence Filtering

We filter the sentences within each topic to se-lect more probable paraphrases for annotation. Our

method is inspired by a typical problem in extractivesummarization, that the salient sentences are likelyredundant (paraphrases) and need to be removedin the output summaries. We employ the scoringmethod used in SumBasic (Nenkova and Vander-wende, 2005; Vanderwende et al., 2007), a simplebut powerful summarization system, to find salientsentences. For each topic, we compute the probabil-ity of each word P (wi) by simply dividing its fre-quency by the total number of all words in all sen-tences. Each sentence s is scored as the average ofthe probabilities of the words in it, i.e.

Salience(s) =∑

wi∈s

P (wi)

|{wi|wi ∈ s}|(7)

We then rank the sentences and pick the originalsentence randomly from top 10% salient sentencesand candidate sentences from top 50% to present tothe annotators.

In a trial experiment of 20 topics, the filteringtechnique double the yield of paraphrases from 152to 329 out of 2000 sentence pairs over naıve ran-dom sampling (Figure 5 and Figure 6). We also usePINC (Chen and Dolan, 2011) to measure the qual-ity of paraphrases collected (Figure 7). PINC wasdesigned to measure n-gram dissimilarity betweentwo sentences, and in essence it is the inverse ofBLEU. In general, the cases with high PINC scoresinclude more complex and interesting rephrasings.

443

5 4 3 2 1

0

100

200

300

400randomfilteredMAB

Number of Annotators Judging as Paraphrases (out of 5)

Num

ber o

f Sen

tenc

e Pa

irs (o

ut o

f 200

0)

Figure 6: Numbers of paraphrases collected by dif-ferent methods. The annotation efficiency (3,4,5are regarded as paraphrases) is significantly im-proved by the sentence filtering and Multi-ArmedBandits (MAB) based topic selection.

5 4 3 2 1 0

0.75

0.80

0.85

0.90

0.95

1.00randomfilteredMAB

Number of Annotators Judging as Paraphrases (out of 5)

PIN

C (l

exic

al d

issi

mila

rity)

Figure 7: PINC scores of paraphrases collected.The higher the PINC, the more significant the re-wording. Our proposed annotation strategy quadru-ples paraphrase yield, while not greatly reducingdiversity as measured by PINC.

4.5 Topic Selection using Multi-Armed Bandits(MAB) Algorithm

Another approach to increasing paraphrase yield isto choose more appropriate topics. This is partic-ularly important because the number of paraphrasesvaries greatly from topic to topic and thus the chanceto encounter paraphrases during annotation (Fig-ure 5). We treat this topic selection problem as avariation of the Multi-Armed Bandit (MAB) prob-lem (Robbins, 1985) and adapt a greedy algorithm,the bounded ε-first algorithm, of Tran-Thanh et al.(2012) to accelerate our corpus construction.

Our strategy consists of two phases. In the firstexploration phase, we dedicate a fraction of the to-tal budget, ε, to explore randomly chosen arms ofeach slot machine (trending topic on Twitter), eachm times. In the second exploitation phase, we sortall topics according to their estimated proportionof paraphrases, and sequentially annotate d (1−ε)Bl−m earms that have the highest estimated reward untilreaching the maximum l = 10 annotations for anytopic to insure data diversity.

We tune the parameters m to be 1 and ε to be be-tween 0.35 ∼ 0.55 through simulation experiments,by artificially duplicating a small amount of real an-notation data. We then apply this MAB algorithmin the real-world. We explore 500 random topicsand then exploited 100 of them. The yield of para-phrases rises to 688 out of 2000 sentence pairs by

using MAB and sentence filtering, a 4-fold increasecompared to only using random selection (Figure 6).

5 Related Work

Automatic Paraphrase Identification has beenwidely studied (Androutsopoulos and Malakasiotis,2010; Madnani and Dorr, 2010). The ACL Wikigives an excellent summary of various techniques.8

Many recent high-performance approaches use sys-tem combination (Das and Smith, 2009; Madnaniet al., 2012; Ji and Eisenstein, 2013). For exam-ple, Madnani et al. (2012) combines multiple sophis-ticated machine translation metrics using a meta-classifier. An earlier attempt on Twitter data is thatof Xu et al. (2013b). They limited the search spaceto only the tweets that explicitly mention a samedate and a same named entity, however there remaina considerable amount of mislabels in their data.9

Zanzotto et al. (2011) also experimented with SVMtree kernel methods on Twitter data.

Departing from the previous work, we proposea latent variable model to jointly infer the corre-spondence between words and sentences. It is re-lated to discriminative monolingual word alignment(MacCartney et al., 2008; Thadani and McKeown,

8http://aclweb.org/aclwiki/index.php?title=Paraphrase_Identification_(State_of_the_art)

9The data is released by Xu et al. (2013b) at: https://github.com/cocoxu/twitterparaphrase/

444

2011; Yao et al., 2013a,b), but different in that theparaphrase task requires additional sentence align-ment modeling with no word alignment data. Ourapproach is also inspired by Fung and Cheung’s(2004a; 2004b) work on bootstrapping bilingual par-allel sentence and word translations from compara-ble corpora.

Multiple Instance Learning (Dietterich et al.,1997) has been used by different research groupsin the field of information extraction (Riedel et al.,2010; Hoffmann et al., 2011; Surdeanu et al., 2012;Ritter et al., 2013; Xu et al., 2013a). The idea isto leverage structured data as weak supervision fortasks such as relation extraction. This is done, forexample, by making the assumption that at least onesentence in the corpus which mentions a pair of en-tities (e1, e2) participating in a relation (r) expressesthe proposition: r(e1, e2).

Crowdsourcing Paraphrase Acquisition: Buzeket al. (2010) and Denkowski et al. (2010) focusedspecifically on collecting paraphrases of text to betranslated to improve machine translation quality.Chen and Dolan (2011) gathered a large-scale para-phrase corpus by asking Mechanical Turk workersto caption the action in short video segments. Sim-ilarly, Burrows et al. (2012) asked crowdsourcingworkers to rewrite selected excerpts from books.Ling et al. (2014) crowdsourced bilingual paralleltext using Twitter as the source of data.

In contrast, we design a simple crowdsourcingtask requiring only binary judgements on sentencescollected from Twitter. There are several advantagesas compared to existing work: a) the corpus alsocovers a very diverse range of topics and linguisticexpressions, especially colloquial language, whichis different from and thus complements previousparaphrase corpora; b) the paraphrase corpus col-lected contains a representative proportion of bothnegative and positive instances, while lack of goodnegative examples was an issue in the previous re-search (Das and Smith, 2009); c) this method is scal-able and sustainable due to the simplicity of the taskand real-time, virtually unlimited text supply fromTwitter.

6 Conclusions

This paper introduced MULTIP, a joint word-sentence model to learn paraphrases from tempo-rally and topically grouped messages in Twitter.While simple and principled, our model achievesperformance competitive with a state-of-the-art en-semble system combining latent semantic represen-tations and surface similarity. By combining ourmethod with previous work as a product-of-expertswe outperform the state-of-the-art. Our latent-variable approach is capable of learning word-levelparaphrase anchors given only sentence annotations.Because our graphical model is modular and ex-tensible (for example it should be possible to re-place the deterministic-or with other aggregators),we are optimistic this work might provide a pathtowards weakly supervised word alignment modelsusing only sentence-level annotations.

In addition, we presented a novel and efficientannotation methodology which was used to crowd-source a unique corpus of paraphrases harvestedfrom Twitter. We make this resource available to theresearch community.

Acknowledgments

The author would like to thank editor Sharon Gold-water and three anonymous reviewers for theirthoughtful comments, which substantially improvedthis paper. We also thank Ralph Grishman, SameerSingh, Yoav Artzi, Mark Yatskar, Chris Quirk, AniNenkova and Mitch Marcus for their feedback.

This material is based in part on research spon-sored by the NSF under grant IIS-1430651, DARPAunder agreement number FA8750-13-2-0017 (theDEFT program) and through a Google Faculty Re-search Award to Chris Callison-Burch. The U.S.Government is authorized to reproduce and dis-tribute reprints for governmental purposes. Theviews and conclusions contained in this publicationare those of the authors and should not be interpretedas representing official policies or endorsements ofDARPA or the U.S. Government. Yangfeng Ji issupported by a Google Faculty Research Awardawarded to Jacob Eisenstein.

445

References

Agirre, E., Diab, M., Cer, D., and Gonzalez-Agirre,A. (2012). Semeval-2012 task 6: A pilot on se-mantic textual similarity. In Proceedings of theFirst Joint Conference on Lexical and Computa-tional Semantics (*SEM).

Androutsopoulos, I. and Malakasiotis, P. (2010).A survey of paraphrasing and textual entailmentmethods. Journal of Artificial Intelligence Re-search, 38.

Artstein, R. and Poesio, M. (2008). Inter-coderagreement for computational linguistics. Compu-tational Linguistics, 34(4).

Bhagat, R. and Hovy, E. (2013). What is a para-phrase? Computational Linguistics, 39(3).

Burrows, S., Potthast, M., and Stein, B. (2012).Paraphrase acquisition via crowdsourcing andmachine learning. Transactions on IntelligentSystems and Technology (ACM TIST).

Buzek, O., Resnik, P., and Bederson, B. B. (2010).Error driven paraphrase annotation using Me-chanical Turk. In Proceedings of the Workshop onCreating Speech and Language Data with Ama-zon’s Mechanical Turk.

Chen, D. L. and Dolan, W. B. (2011). Collectinghighly parallel data for paraphrase evaluation. InProceedings of the 49th Annual Meeting of the As-sociation for Computational Linguistics (ACL).

Collins, M. (2002). Discriminative training methodsfor hidden Markov models: Theory and experi-ments with perceptron algorithms. In Proceed-ings of the Conference on Empirical Methods onNatural Language Processing (EMNLP).

Das, D. and Smith, N. A. (2009). Paraphrase identi-fication as probabilistic quasi-synchronous recog-nition. In Proceedings of the Joint Conferenceof the 47th Annual Meeting of the Associationfor Computational Linguistics and the 4th Inter-national Joint Conference on Natural LanguageProcessing of the Asian Federation of NaturalLanguage Processing (ACL-IJCNLP).

Denkowski, M., Al-Haj, H., and Lavie, A. (2010).Turker-assisted paraphrasing for English-Arabicmachine translation. In Proceedings of the Work-

shop on Creating Speech and Language Data withAmazon’s Mechanical Turk.

Derczynski, L., Ritter, A., Clark, S., and Bontcheva,K. (2013). Twitter part-of-speech tagging for all:Overcoming sparse and noisy data. In Proceed-ings of the Recent Advances in Natural LanguageProcessing (RANLP).

Diao, Q., Jiang, J., Zhu, F., and Lim, E.-P. (2012).Finding bursty topics from microblogs. In Pro-ceedings of the 50th Annual Meeting of the Asso-ciation for Computational Linguistics (ACL).

Dietterich, T. G., Lathrop, R. H., and Lozano-Perez,T. (1997). Solving the multiple instance prob-lem with axis-parallel rectangles. Artificial Intel-ligence, 89(1).

Dolan, B., Quirk, C., and Brockett, C. (2004). Un-supervised construction of large paraphrase cor-pora: Exploiting massively parallel news sources.In Proceedings of the 20th International Confer-ence on Computational Linguistics (COLING).

Dolan, W. and Brockett, C. (2005). Automaticallyconstructing a corpus of sentential paraphrases. InProceedings of the 3rd International Workshop onParaphrasing.

Dunning, T. (1993). Accurate methods for the statis-tics of surprise and coincidence. ComputationalLinguistics, 19(1).

Fellbaum, C. (2010). WordNet. In Theory and Ap-plications of Ontology: Computer Applications.Springer.

Francis, W. N. and Kucera, H. (1979). Brown corpusmanual. Brown University.

Fung, P. and Cheung, P. (2004a). Mining very-non-parallel corpora: Parallel sentence and lexicon ex-traction via bootstrapping and EM. In Proceed-ings of the Conference on Empirical Methods inNatural Language Processing (EMNLP).

Fung, P. and Cheung, P. (2004b). Multi-level boot-strapping for extracting parallel sentences from aquasi-comparable corpus. In Proceedings of theInternational Conference on Computational Lin-guistics (COLING).

Guo, W. and Diab, M. (2012). Modeling sentencesin the latent space. In Proceedings of the 50th

446

Annual Meeting of the Association for Computa-tional Linguistics (ACL).

Guo, W., Li, H., Ji, H., and Diab, M. (2013). Link-ing tweets to news: A framework to enrich shorttext data in social media. In Proceedings of the51th Annual Meeting of the Association for Com-putational Linguistics (ACL).

Han, B., Cook, P., and Baldwin, T. (2012). Auto-matically constructing a normalisation dictionaryfor microblogs. In Proceedings of the Confer-ence on Empirical Methods on Natural LanguageProcessing and Computational Natural LanguageLearning (EMNLP-CoNLL).

Hinton, G. E. (2002). Training products of expertsby minimizing contrastive divergence. NeuralComputation, 14(8).

Hoffmann, R., Zhang, C., Ling, X., Zettlemoyer,L. S., and Weld, D. S. (2011). Knowledge-basedweak supervision for information extraction ofoverlapping relations. In Proceedings of the 49thAnnual Meeting of the Association for Computa-tional Linguistics (ACL).

Hovy, E., Marcus, M., Palmer, M., Ramshaw, L.,and Weischedel, R. (2006). OntoNotes: the 90%solution. In Proceedings of the Human LanguageTechnology Conference - North American Chap-ter of the Association for Computational Linguis-tics Annual Meeting (HLT-NAACL).

Ji, Y. and Eisenstein, J. (2013). Discriminativeimprovements to distributional sentence similar-ity. In Proceedings of the Conference on Em-pirical Methods in Natural Language Processing(EMNLP).

Liang, P., Bouchard-Cote, A., Klein, D., and Taskar,B. (2006). An end-to-end discriminative approachto machine translation. In Proceedings of the 21stInternational Conference on Computational Lin-guistics and the 44th annual meeting of the Asso-ciation for Computational Linguistics (COLING-ACL).

Ling, W., Marujo, L., Dyer, C., Alan, B., and Isabel,T. (2014). Crowdsourcing high-quality paralleldata extraction from Twitter. In Proceedings ofthe Ninth Workshop on Statistical Machine Trans-lation (WMT).

MacCartney, B., Galley, M., and Manning, C.(2008). A phrase-based alignment model fornatural language inference. In Proceedings ofthe Conference on Empirical Methods in NaturalLanguage Processing (EMNLP).

Madnani, N. and Dorr, B. J. (2010). Generatingphrasal and sentential paraphrases: A survey ofdata-driven methods. Computational Linguistics,36(3).

Madnani, N., Tetreault, J., and Chodorow, M.(2012). Re-examining machine translation met-rics for paraphrase identification. In Proceedingsof the Conference of the North American Chapterof the Association for Computational Linguistics- Human Language Technologies (NAACL-HLT).

Minnen, G., Carroll, J., and Pearce, D. (2001). Ap-plied morphological processing of english. Natu-ral Language Engineering, 7(03).

Moore, R. C. (2004). On log-likelihood-ratios andthe significance of rare events. In Proceedings ofthe Conference on Empirical Methods in NaturalLanguage Processing (EMNLP).

Nenkova, A. and Vanderwende, L. (2005). The im-pact of frequency on summarization. Technicalreport, Microsoft Research. MSR-TR-2005-101.

O’Connor, B., Krieger, M., and Ahn, D. (2010).Tweetmotif: Exploratory search and topic sum-marization for Twitter. In Proceedings of the 4thInternational AAAI Conference on Weblogs andSocial Media (ICWSM).

Petrovic, S., Osborne, M., and Lavrenko, V. (2012).Using paraphrases for improving first story detec-tion in news and Twitter. In Proceedings of theConference of the North American Chapter of theAssociation for Computational Linguistics - Hu-man Language Technologies (NAACL-HLT).

Riedel, S., Yao, L., and McCallum, A. (2010). Mod-eling relations and their mentions without labeledtext. In Proceedigns of the European Conferenceon Machine Learning and Principles and Practiceof Knowledge Discovery in Databases (ECML-PKDD).

Ritter, A., Mausam, Etzioni, O., and Clark, S.(2012). Open domain event extraction from Twit-ter. In Proceedings of the 18th International Con-

447

ference on Knowledge Discovery and Data Min-ing (SIGKDD).

Ritter, A., Zettlemoyer, L., Mausam, and Etzioni, O.(2013). Modeling missing data in distant super-vision for information extraction. Transactionsof the Association for Computational Linguistics(TACL).

Robbins, H. (1985). Some aspects of the sequen-tial design of experiments. In Herbert RobbinsSelected Papers. Springer.

Sekine, S. (2005). Automatic paraphrase discoverybased on context and keywords between NE pairs.In Proceedings of the 3rd International Workshopon Paraphrasing.

Shinyama, Y., Sekine, S., and Sudo, K. (2002). Au-tomatic paraphrase acquisition from news articles.In Proceedings of the 2nd International Confer-ence on Human Language Technology Research(HLT).

Surdeanu, M., Tibshirani, J., Nallapati, R., and Man-ning, C. D. (2012). Multi-instance multi-labellearning for relation extraction. In Proceedingsof the 50th Annual Meeting of the Association forComputational Linguistics (ACL).

Thadani, K. and McKeown, K. (2011). Optimal andsyntactically-informed decoding for monolingualphrase-based alignment. In Proceedings of the49th Annual Meeting of the Association for Com-putational Linguistics - Human Language Tech-nologies (ACL-HLT).

Tran-Thanh, L., Stein, S., Rogers, A., and Jennings,N. R. (2012). Efficient crowdsourcing of un-known experts using multi-armed bandits. In Pro-ceedings of the European Conference on ArtificialIntelligence (ECAI).

Vanderwende, L., Suzuki, H., Brockett, C., andNenkova, A. (2007). Beyond SumBasic: Task-focused summarization with sentence simplifica-tion and lexical expansion. Information Process-ing & Management, 43.

Wan, S., Dras, M., Dale, R., and Paris, C. (2006).Using dependency-based features to take the“para-farce” out of paraphrase. In Proceedingsof the Australasian Language Technology Work-shop.

Wang, L., Dyer, C., Black, A. W., and Trancoso,I. (2013). Paraphrasing 4 microblog normaliza-tion. In Proceedings of the Conference on Em-pirical Methods on Natural Language Processing(EMNLP).

Winkler, W. E. (1999). The state of record link-age and current research problems. Technical re-port, Statistical Research Division, U.S. CensusBureau.

Xu, W., Hoffmann, R., Zhao, L., and Grishman, R.(2013a). Filling knowledge base gaps for distantsupervision of relation extraction. In Proceedingsof the 51st Annual Meeting of the Association forComputational Linguistics (ACL).

Xu, W., Ritter, A., and Grishman, R. (2013b). Gath-ering and generating paraphrases from Twitterwith application to normalization. In Proceed-ings of the Sixth Workshop on Building and UsingComparable Corpora (BUCC).

Yao, X., Van Durme, B., Callison-Burch, C., andClark, P. (2013a). A lightweight and high perfor-mance monolingual word aligner. In Proceedingsof the 49th Annual Meeting of the Association forComputational Linguistics (ACL).

Yao, X., Van Durme, B., and Clark, P. (2013b).Semi-markov phrase-based monolingual align-ment. In Proceedings of the Conference on Em-pirical Methods on Natural Language Processing(EMNLP).

Zanzotto, F. M., Pennacchiotti, M., and Tsiout-siouliklis, K. (2011). Linguistic redundancy inTwitter. In Proceedings of the Conference on Em-pirical Methods in Natural Language Processing(EMNLP).

Zettlemoyer, L. S. and Collins, M. (2007). On-line learning of relaxed CCG grammars for pars-ing to logical form. In Proceedings of the 2007Joint Conference on Empirical Methods in Natu-ral Language Processing and Computational Nat-ural Language Learning (EMNLP-CoNLL).

Zhang, C. and Weld, D. S. (2013). Harvesting paral-lel news streams to generate paraphrases of eventrelations. In Proceedings of the Conference onEmpirical Methods in Natural Language Process-ing (EMNLP).

448

Documents

Extracting Lexically Divergent Paraphrases from Twitter · PDF fileExtracting Lexically Divergent Paraphrases from Twitter Wei Xu 1, Alan Ritter 2, Chris Callison-Burch 1, William