11

Click here to load reader

Harvesting Parallel News Streams to Generate Paraphrases of

Embed Size (px)

Citation preview

Page 1: Harvesting Parallel News Streams to Generate Paraphrases of

Harvesting Parallel News Streams to Generate Paraphrases of EventRelations

Congle Zhang, Daniel S. WeldComputer Science & Engineering

University of WashingtonSeattle, WA 98195, USA

{clzhang,weld}@cs.washington.edu

Abstract

The distributional hypothesis, which statesthat words that occur in similar contexts tendto have similar meanings, has inspired sev-eral Web mining algorithms for paraphras-ing semantically equivalent phrases. Unfortu-nately, these methods have several drawbacks,such as confusing synonyms with antonymsand causes with effects. This paper intro-duces three Temporal Correspondence Heuris-tics, that characterize regularities in parallelnews streams, and shows how they may beused to generate high precision paraphrasesfor event relations. We encode the heuristicsin a probabilistic graphical model to createthe NEWSSPIKE algorithm for mining newsstreams. We present experiments demon-strating that NEWSSPIKE significantly outper-forms several competitive baselines. In orderto spur further research, we provide a largeannotated corpus of timestamped news arti-cles as well as the paraphrases produced byNEWSSPIKE.

1 Introduction

Paraphrasing, the task of finding sets of semanticallyequivalent surface forms, is crucial to many natu-ral language processing applications, including re-lation extraction (Bhagat and Ravichandran, 2008),question answering (Fader et al., 2013), summa-rization (Barzilay et al., 1999) and machine transla-tion (Callison-Burch et al., 2006). While the benefitsof paraphrasing have been demonstrated, creating alarge-scale corpus of high precision paraphrases re-mains a challenge — especially for event relations.

Many researchers have considered generatingparaphrases by mining the Web guided by the dis-

tributional hypothesis, which states that words oc-curring in similar contexts tend to have similarmeanings (Harris, 1954). For example, DIRT (Linand Pantel, 2001) and Resolver (Yates and Etzioni,2009) identify synonymous relation phrases by thedistributions of their arguments. However, the dis-tributional hypothesis has several drawbacks. First,it can confuse antonyms with synonyms becauseantonymous phrases appear in similar contexts as of-ten as synonymous phrases. For the same reasons, italso often confuses causes with effects. For exam-ple, DIRT reports that the closest phrase to fall isrise, and the closest phrase to shoot is kill.1 Sec-ond, the distributional hypothesis relies on statis-tics over large corpora to produce accurate similaritystatistics. It remains unclear how to accurately para-phrase less frequent relations with the distributionalhypothesis.

Another common approach employs the use ofparallel corpora. News articles are an interestingtarget, because there often exist articles from dif-ferent sources describing the same daily events.This peculiar property allows the use of the tem-poral assumption, which assumes that phrases inarticles published at the same time tend to havesimilar meanings. For example, the approaches byDolan et al. (2004) and Barzilay et al. (2003) iden-tify pairs of sentential paraphrases in similar arti-cles that have appeared in the same period of time.While these approaches use temporal informationas a coarse filter in the data generation stage, theystill largely rely on text metrics in the predictionstage. This not only reduces precision, but also lim-its the discovery of paraphrases with dissimilar sur-

1http://demo.patrickpantel.com/demos/lexsem/paraphrase.htm

Page 2: Harvesting Parallel News Streams to Generate Paraphrases of

face strings.The goal of our research is to develop a technique

to generate paraphrases for large numbers of eventrelation with high precision, using only minimal hu-man effort. The key to our approach is a joint clustermodel using the temporal attributes of news streams,which allows us to identify semantic equivalenceof event relation phrases with greater precision. Insummary, this paper makes the following contribu-tions:• We formulate a set of three temporal corre-

spondence heuristics that characterize regulari-ties over parallel news streams.• We develop a novel program, NEWSSPIKE,

based on a probabilistic graphical model thatjointly encodes these heuristics. We present in-ference and learning algorithms for our model.• We present a series of detailed experiments

demonstrating that NEWSSPIKE outperformsseveral competitive baselines, and show throughablation tests how each of the temporal heuris-tics affects performance.• To spur further research on this topic, we pro-

vide both our generated paraphrase clusters anda corpus of 0.5M time-stamped news articles2,collected over a period of about 50 days fromhundreds of news sources.

2 System Overview

The main goal of this work is to generate high preci-sion paraphrases for relation phrases. News streamsare a promising resource, since articles from dif-ferent sources tend to use semantically equivalentphrases to describe the same daily events. For ex-ample, when a recent scandal hit, headlines read:“Armstrong steps down from Livestrong”; “Arm-strong resigns from Livestrong” and “Armstrongcuts ties with Livestrong”. From these we can con-clude that the following relation phrases are seman-tically similar: {step down from, resign from, cut tieswith}.

To realize this intuition, our first challenge isto represent an event. In practice, a question like“What happened to Armstrong and Livestrong onOct 17?” could often lead to a unique answer. It im-

2https://www.cs.washington.edu/node/9473/

Given news streams

OpenIE

Joint inference

model

(a1,r,a2,t)

Temporal Heuristics

Temporal features

& constraints

Extracted Event

candidates (EEC) &

relation phrases

r1 r2 r3

(a1,a2,t)

r1 r2 r3 r4 r5

r1 r2 r3

(a1,a2,t)

r1 r2 r3 r4 r5

Shallow timestamped

extractions

Group

Relation phrases

Describing the EEC

r1 r2 r3

(a1,a2,t)

r1 r2 r3 r4 r5

r1 r2 r3

(a1,a2,t) {r1, r3, r4}

Paraphrase

clusters

Create

clusters

r1 r3 r4

Figure 1: NEWSSPIKE first applies open informa-tion extraction to articles in the news streams, obtain-ing shallow extractions with time-stamps. Next, anextracted event candidate (EEC) is obtained after group-ing daily extractions by argument pairs. Temporal fea-tures and constraints are developed based on our tempo-ral correspondence heuristics and encoded into a joint in-ference model. The model finally creates the paraphraseclusters by predicting the relation phrases that describethe EEC.

plies that using an argument pair and a time-stampcould be an effective way to identify an event (e.g.(Armstrong, Livestrong, Oct 17) for the previousquestion). Based on this observation, this paper in-troduces a novel mechanism to paraphrase relationsas summarized in Figure 1.

NEWSSPIKE first applies the ReVerb open infor-mation extraction (IE) system (Fader et al., 2011)on the news streams to obtain a set of (a1, r, a2, t)tuples, where the ai are the arguments, r is a re-lation phrase, and t is the time-stamp of the cor-responding news article. When (a1, a2, t) suggestsa real word event, the relation r of (a1, r, a2, t) islikely to describe that event (e.g. (Armstrong, resignfrom, Livestrong, Oct 17). We call every (a1, a2, t)an extracted event candidate (EEC), and every rela-tion describing the event an event-mention.

For each EEC (a1, a2, t), suppose there are m ex-traction tuples (a1, r1, a2, t) . . . (a1, rm, a2, t) shar-ing the values of a1, a2, and t. We refer to thisset of extraction tuples as the EEC-set, and denoteit (a1, a2, t, {r1 . . . rm}). All the event-mentions inthe EEC-set may be semantically equivalent and arehence candidates for a good paraphrase cluster.

Thus, the paraphrasing problem becomes a pre-diction problem: for each relation ri in the EEC-set,does it or does it not describe the hypothesizedevent? We solve this problem in two steps. The

Page 3: Harvesting Parallel News Streams to Generate Paraphrases of

next section proposes a set of temporal correspon-dence heuristics that partially characterize semanti-cally equivalent EEC-sets. Then, in Section 4, wepresent a joint inference model designed to use theseheuristics to solve the prediction problem and togenerate paraphrase clusters.

3 Temporal Correspondence Heuristics

In this section, we propose a set of temporal heuris-tics that are useful to generate paraphrases at highprecision. Our heuristics start from the basic obser-vation mentioned previously — events can often beuniquely determined by their arguments and time.Additionally, we find that it is not just the publica-tion time of the news story that matters, the verbtenses of the sentences are also important. For ex-ample, the two sentences “Armstrong was the chair-man of Livestrong” and “Armstrong steps downfrom Livestrong” have past and present tense re-spectively, which suggests that the relation phrasesare less likely to describe the same event and arethus not semantically equivalent. To capture theseintuitions, we propose the Temporal FunctionalityHeuristic:

Temporal Functionality Heuristic. News articlespublished at the same time that mention the sameentities and use the same tense tend to describe thesame events.

Unfortunately, we find that not all the event can-didates, (a1, a2, t), are equally good for paraphras-ing. For example, today’s news might includeboth “Barack Obama heads to the White House”and “Barack Obama greets reporters at the WhiteHouse”. Although the two sentences are highlysimilar, sharing a1 = “Barack Obama” and a2 =“White House,” and were published at the sametime, they describe different events.

From a probabilistic point of view, we can treateach sentence as being generated by a particular hid-den event which involves several actors. Clearly,some of these actors, like Obama, participate inmany more events than others, and in such caseswe observe sentences generated from a mixture ofevents. Since two event mentions from such a mix-ture are much less likely to denote the same eventor relation, we wish to distinguish them from thebetter (semantically homogeneous) EECs like the(Armstrong, Livestrong) example. The question be-

comes “How one can distinguish good entity pairsfrom bad?”

Our method rests on the simple observation thatan entity which participates in many different eventson one day is likely to have participated in eventsin recent days. Therefore we can judge whether anentity pair is good for paraphrasing by looking atthe history of the frequencies that the entity pair ismentioned in the news streams, which is the timeseries of that entity pair. The time series of the entitypair (Barack Obama, the White House) tends to behigh over time, while the time series of the entitypair (Armstrong, Livestrong) is flat for a long timeand suddenly spikes upwards on a single day. Thisobservation leads to:

Temporal Burstiness Heuristic. If an entity or anentity pair appears significantly more frequently inone day’s news than in recent history, the corre-sponding event candidates are likely to be good togenerate paraphrase.

The temporal burstiness heuristic implies that agood EEC (a1, a2, t) tends to have a spike in the timeseries of its entities ai, or argument pair (a1, a2), onday t.

However, even if we have selected a good EECfor paraphrasing, it is likely that it contains a fewrelation phrases that are related to (but not synony-mous with) the other relations included in the EEC.For example, it’s likely that the news story report-ing “Armstrong steps down from Livestrong.” mightalso mention “Armstrong is the founder of Live-strong.” and so both “steps down from” and “is thefounder of” relation phrases would be part of thesame EEC-set. Inspired by the idea of one sense perdiscourse from (Gale et al., 1992), we propose:

One Event-Mention Per Discourse Heuristic. Anews article tends not to state the same fact morethan once.

The one event-mention per discourse heuristic isproposed in order to gain precision at the expenseof recall — the heuristic directs an algorithm tochoose, from a news story, the single “best” relationphrase connecting a pair of two entities. Of course,this doesn’t answer the question of deciding whichphrase is “best.” In Section 4.3, we describe howto learn a probabilistic graphical model which doesexactly this.

Page 4: Harvesting Parallel News Streams to Generate Paraphrases of

4 Exploiting the Temporal Heuristics

In this section we propose several models to capturethe temporal correspondence heuristics, and discusstheir pros and cons.

4.1 Baseline Model

An easy way to use an EEC-set is to simply predictthat all ri in the EEC-set are event-mentions, andhence are semantically equivalent. That is, givenEEC-set (a1, a2, t, {r1 . . . rm}), the output cluster is{r1 . . . rm}.

This baseline model captures the most of the tem-poral functionality heuristic, except for the tense re-quirement. Our empirical study shows that it per-forms surprisingly well. This demonstrates that thequality of our input for the learning model is good:the EEC-sets are promising resources for paraphras-ing.

Unfortunately, the baseline model cannot dealwith the other heuristics, a problem we will remedyin the following sections.

4.2 Pairwise Model

The temporal functionality heuristic suggests we ex-ploit the tenses of the relations in an EEC-set; whilethe temporal burstiness heuristic suggests we ex-ploit the time series of its arguments. A pairwisemodel can be designed to capture them: we comparepairs of relations in the EEC-set, and predict whethereach pair is synonymous or non-synonymous. Para-phrase clusters are then generated according to someheuristic rules (e.g. assuming transitivity amongsynonyms). The tenses of the relations and time se-ries of the arguments are encoded as features, whichwe call tense features and spike features respec-tively. An example tense feature is whether one re-lation is past tense while the other relation is presenttense; an example spike feature is the covariance ofthe time series.

The pairwise model can be considered similar toparaphrasing techniques which examine two sen-tences and determine whether they are semanticallyequivalent (Dolan and Brockett, 2005; Socher et al.,2011). Unfortunately, these techniques often basedpurely on text metrics and does not consider anytemporal attributes. In section 5, we evaluate theeffect of applying these techniques.

1

𝑍 (Armstrong,Livestrong,Oct.17)

0

1

0

𝑌 be founder of

𝑌 step down

𝑌 give speech at 0

1

𝑌 be chairman of

Article2 Article1

𝑌 resign from

Φjoint

ΦZ

Φ2Y

Φ1Y

Figure 2: an example model for EEC (Armstrong, Live-strong, Oct 17). Y and Z are binary random variables.ΦY , ΦZ and Φjoint are factors. be founder of and stepdown come from article 1 while give speech at, be chair-man of and resign from come from article 2.

4.3 Joint Cluster ModelThe pairwise model has several drawbacks: 1) itlacks the ability to handle constraints, such as themutual exclusion constraint implied by the one-mention per discourse heuristic; 2) ad-hoc rules,rather than formal optimizations, are required togenerate clusters containing more than two relations.

A common approach to overcome the drawbacksof the pairwise model and to combine heuristics to-gether is to introduce a joint cluster model, in whichheuristics are encoded as features and constraints.Data, instead of ad-hoc rules, determines the rel-evance of different insights, which can be learnedas parameters. The advantage of the joint modelis analogous to that of cluster-based approaches forcoreference resolution (CR). In particular, a jointmodel can better capture constraints on multiplevariables and can yield higher quality results thanpairwise CR models (Rahman and Ng, 2009).

We propose an undirected graphical model,NEWSSPIKE, which jointly clusters relations. Con-straints are captured by factors connecting multiplerandom variables. We introduce random variables,the factors, the objective function, the inference al-gorithm, and the learning algorithm in the followingsections. Figure 2 shows an example model for EEC(Armstrong, Livestrong, Oct 17).

4.3.1 Random VariablesFor the EEC-set (a1, a2, t, {r1, . . . rm}), we intro-

duce one event variable and m relation variables, allboolean valued. The event variable Z(a1,a2,t) indi-

Page 5: Harvesting Parallel News Streams to Generate Paraphrases of

cates whether (a1, a2, t) is a good event for para-phrasing. It is designed in accordance with thetemporal burstiness heuristic: for the EEC (BarackObama, the White House, Oct 17), Z should be as-signed the value 0.

The relation variable Y r indicates whether rela-tion r describes the EEC (a1, a2, t) or not (i.e. r is anevent-mention or not). The set of all event-mentionswith Y r = 1 define a paraphrase cluster, contain-ing relation phrases. For example, the assignmentsY step down = Y resign from = 1 produce a paraphrasecluster {step down, resign from}.

4.3.2 Factors and the Joint Distribution

In this section, we introduce a conditional proba-bility model defining a joint distribution over all ofthe event and relation variables. The joint distribu-tion is a function over factors. Our model containsevent factors, relation factors and joint factors.

The event factor ΦZ is a log-linear function withspike features, used to distinguish good events. A re-lation factor ΦY is also a log-linear function. It canbe defined for individual relation variables (e.g. ΦY

1

in Figure 2) with features such as whether a relationphrase comes from a clausal complement3. A rela-tion factor can also be defined for a pair of relationvariables (e.g. ΦY

2 in Figure 2) with features captur-ing the pairwise evidence for paraphrasing, such asif two relation phrases have the same tense.

The joint factors Φjoint are defined to apply con-straints implied by the temporal heuristics. Theyplay two roles in our model: 1) to satisfy the tempo-ral burstiness heuristic, when the value of the eventvariable is false, the EEC is not appropriate for para-phrasing, and so all relation variables should also befalse; and 2) to satisfy the one-mention per discourseheuristic, at most one relation variable from a singlearticle could be true.

We define the joint distribution over these vari-ables and factors as follows. Let Y = (Y r1 . . . Y rm)be the vector of relation variables; let x be the fea-tures. The joint distribution is:

3Relation phrases in clausal complement are less useful forparaphrasing because they often do not describe a fact. For ex-ample, in the sentence He heard Romney had won the election,the extraction (Romney, had won, the election) is not a fact atall.

p(Z = z,Y = y|x; Θ)def=

1

ZxΦZ(z,x)

×∏d

Φjoint(z,yd,x)∏i,j

ΦY (yi, yj ,x)

where yd indicates the subset of relation variablesfrom a particular article d, and the parameter vectorΘ is the weight vector of the features in ΦZ and ΦY ,which are log-linear functions; i.e.,

ΦY (yi, yj ,x)def= exp

∑j

θjφj(yi, yj ,x)

where φj is the jth feature function.

The joint factors Φjoint are used to apply the tem-poral burstiness heuristic and the one event-mentionper discourse heuristic. Φjoint is zero when the EECis not good for paraphrasing, but some yr = 1; orwhen there is more than one r in a single article suchthat yr = 1. Formally, it is calculated as:

Φjoint(z,yd,x)def=

0 if z = 0 ∧ ∃yr = 1

0 if∑

yr∈ydyr > 1

1 otherwise

4.3.3 Maximum a Posteriori InferenceThe goal of inference is to find the predictions z,y

which yield the greatest probability, i.e.,

z∗,y∗ = arg maxz,y

p(Z = z,Y = y|x; Θ)

This can be viewed as a MAP inference problem.In general, inference in a graphical model is chal-lenging. Fortunately, the joint factors in our modelare linear, and the event and relation factors are log-linear; we can cast MAP inference as an integer lin-ear programming (ILP) problem, and then computean approximation in polynomial time by means oflinear programming using randomized rounding, asproposed in (Yannakakis, 1992).

We build one ILP problem for every EEC. Thevariables of the ILP are Z and Y, which only takevalues of 0 or 1. The objective function is the sumof logs of the event and relation factors ΦZ andΦY . The temporal burstiness heuristic of Φjoint isencoded as a linear inequality constraint z ≥ yi; theone-mention per discourse heuristic of Φjoint is en-coded as the constraint

∑yi∈yd

yi ≤ 1.

Page 6: Harvesting Parallel News Streams to Generate Paraphrases of

4.3.4 LearningOur training data consists a set of N = 500 la-

beled EEC-sets each in the form of {(Ri, Rgoldi ) |Ni=1

}. Each R is the set of all relations in the EEC-setwhile Rgold is a manually selected subset of R con-taining relations describing the EEC. Rgold could beempty if the EEC was deemed poor for paraphras-ing. For our model, the gold assignment yrgold = 1if r ∈ Rgold; the gold assignment zgold = 1 if Rgold

is not empty.Given {(Ri, R

goldi ) |Ni=1}, learning over similar

models is commonly done via maximum likelihoodestimation as follows:

L(Θ) = log∏i

p(Zi = zgoldi ,Yi = ygoldi | xi,Θ)

For features in relation factors, the partial deriva-tive for the ith model is:

Φj(ygoldi ,xi)− Ep(zi,yi|,xi,Θ)Φj(yi,xi)

where Φj(yi,xi) =∑φj(X,Y,x), the sum of val-

ues for the jth feature in the ith model; and valuesof X,Y come from the assignment yi. For featuresin event factors, the partial derivative is derived sim-ilarly as

φj(zgoldi ,xi)− Ep(zi,yi|,xi,Θ)φj(zi,xi)

It is unclear how to efficiently compute the expec-tations in the above formula, a brute force approachrequires enumerating all assignments of yi, whichis exponentially large with the number of relations.Instead, we opt to use a more tractable perceptronlearning approach (Collins, 2002; Hoffmann et al.,2011). Instead of computing the expectations, wesimply compute φj(z∗i ,xi) and Φj(y

∗i ,xi), where

z∗i ,y∗i is the assignment with the highest probabil-

ity, generated by the MAP inference algorithm us-ing the current weight vector. The weight updatesare the following:

Φj(ygoldi ,xi)− Φj(y

∗i ,xi) (1)

φj(zgoldi ,xi)− φj(z∗i ,xi) (2)

The updates can be intuitively explained as penal-ties on errors. In sum, our learning algorithm con-sists of iterating the following two steps: (1) in-fer the most probable assignment given the currentweights; (2) update the weights by comparing in-ferred assignments and the truth assignment.

5 Empirical Study

We first introduce the experimental setup for our em-pirical study, and then we attempt to answer twoquestions in sections 5.2 and 5.3 respectively: First,does the NEWSSPIKE algorithm effectively exploitthe proposed heuristics and outperform other ap-proaches which also use news streams? Secondly,do the proposed temporal heuristics paraphrase re-lations with greater precision than the distributionalhypothesis?

5.1 Experimental Setup

Since we were unable to find any elaborate time-stamped, parallel, news corpus, we collected datausing the following procedure:• Collect RSS news seeds, which contain the title,

time-stamp, and abstract of the news items.• Use these titles to query the Bing news search

engine API and collect additional time-stampednews articles.

• Strip HTML tags from the news articles usingBoilerpipe (Kohlschutter et al., 2010); keep onlythe title and first paragraph of each article.

• Extract shallow relation tuples using the OpenIEsystem (Fader et al., 2011).

We performed these steps every day from Jan-uary 1 to February 22, 2013. In total, we collected546,713 news articles, for which 2.6 million extrac-tions had 529 thousand unique relations.

We used several types of features for paraphras-ing: 1) spike features obtained from time series; 2)tense features, such as whether two relation phrasesare both in the present tense; 3) cause-effect fea-tures, such as whether two relation phrases often ap-pear successively in the news articles; 4) text fea-tures, such as whether sentences are similar; 5) syn-tactic features, such as whether a relation phraseappears in a clausal complement; and 6) semanticfeatures, such as whether a relation phrase containsnegative words.

Text and semantic features are encoded using therelation factors of section 4.3.2. For example, in Fig-ure 2, the factor ΦY

2 includes the textual similaritybetween the sentences containing the phrases “stepdown” and “be chairman of” respectively; it alsoincludes the feature that the tense of “step down”(present) is different from the tense of “be chairman

Page 7: Harvesting Parallel News Streams to Generate Paraphrases of

output {go into, go to, speak, return,head to}

gold {go into, go to, approach, head to}golddiv {go ∗, approach, head to}

P/R precision = 3/5 recall = 3/4

P/Rdiv precisiondiv = 2/4 recalldiv = 2/3

Figure 3: an example pair of the output cluster and thegold cluster, and the corresponding precision recall num-bers.

of” (past).

5.2 Comparison with Methods using ParallelNews Corpora

We evaluated NEWSSPIKE against other methodsthat also use time-stamped news. These include themodels mentioned in section 3 and state-of-the-artparaphrasing techniques.

Human annotators created gold paraphrase clus-ters for 500 EEC-sets; note that some EEC-setsyield no gold cluster, since at least two synonymousphrases. Two annotators were shown a set of candi-date relation phrases in context and asked to select asubset of these that described a shared event (if oneexisted). There was 98% phrase-level agreement.Precision and recall were computed by comparingan algorithm’s output clusters to the gold cluster ofeach EEC. We consider paraphrases with minor lex-ical diversity, e.g. (go to, go into), to be of lesser in-terest. Since counting these trivial paraphrases tendsto exaggerate the performance of a system, we alsoreport precision and recall on diverse clusters i.e.,those whose relation phrases all have different headverbs. Figure 3 illustrates these metrics with an ex-ample; note under our diverse metrics, all phrasesmatching go * count as one when computing bothprecision and recall. We conduct 5-fold cross val-idation on our labeled dataset to get precision andrecall numbers when the system requires training.

We compare NEWSSPIKE with the models in Sec-tion 4, and also with the state-of-the-art paraphraseextraction method:

Baseline: the model discussed in Section 4.1.This system does not need any training, and gener-ates outputs with perfect recall.

Pairwise: the pairwise model discussed in Sec-tion 4.2 and using the same set of features as used

System P/R P/R diverseprec rec prec rec

Baseline 0.67 1.00 0.53 1.00Pairwise 0.90 0.60 0.81 0.37Socher 0.81 0.35 0.68 0.29

NEWSSPIKE 0.92 0.55 0.87 0.31

Table 1: Comparison with methods using parallel newscorpora

by NEWSSPIKE. To generate output clusters, transi-tivity is assumed inside the EEC-set. For example,when the pairwise model predicts that (r1, r2) and(r1, r3) are both paraphrases, the resulting cluster is{r1, r2, r3}.

Socher: Socher et al. (2011) achieved the best re-sults on the Dolan et al. (2004) dataset, and releasedtheir code and models. We used their off-the-shelfpredictor to replace the classifier in our Pairwisemodel. Given sentential paraphrases, aligning rela-tion phrases is natural, because OpenIE has alreadyidentified the relation phrases.

Table 1 shows precision and recall numbers. Itis interesting that the basic model already obtains0.67 precision overall and 0.53 in the diverse con-dition. This demonstrates that the EEC-sets gen-erated from the news streams are a promising re-source for paraphrasing. Socher’s method performsbetter, but not as well as Pairwise or NEWSSPIKE,especially in the diverse cases. This is probablydue to the fact that Socher’s method is based purelyon text metrics and does not consider any tempo-ral attributes. Taking into account the features usedby NEWSSPIKE, Pairwise significantly improves theprecision, which demonstrates the power of our tem-poral correspondence heuristics. Our joint clustermodel, NEWSSPIKE, which considers both temporalfeatures and constraints, gets the best performancein both conditions.

We conducted ablation testing to evaluate howspike features and tense features, which are par-ticularly relevant to the temporal aspects of newsstreams, can improve performance. Figure 4 com-pares the precision/recall curves for three systemsin the diverse condition: (1) NEWSSPIKE; (2)w/oSpike: turning off all spike features; and (3)w/oTense: turning off all features about tense.(4) w/oDiscourse: turning off one event-mentionper discourse heuristic. There are some dips in

Page 8: Harvesting Parallel News Streams to Generate Paraphrases of

0.1 0.2 0.3 0.40.6

0.7

0.8

0.9

1.0w/oSpikew/oTense

NewsSpike

Recall

Pre

cis

ion

w/oDiscourse

Figure 4: Precision recall curves on hard, diverse casesfor NewsSpike, w/oSpike, w/oTense and w/oDiscourse.

the curves because they are drawn after sortingthe predictions by the value of the correspondingILP objective functions, which do not perfectly re-flect prediction accuracy. However, it is clear thatNEWSSPIKE produces greater precision over allranges of recall.

5.3 Comparison with Methods using theDistributional Hypothesis

We evaluated our model against methods based onthe distributional hypothesis. We ran NEWSSPIKE

over all EEC-sets except for the development set andcompared to the following systems:

Resolver: Resolver (Yates and Etzioni, 2009)uses a set of extraction tuples in the form of(a1, r, a2) as the input and creates a set of relationclusters as the output paraphrases. Resolver alsoproduces argument clusters, but this paper only eval-uates relation clustering. We evaluated Resolver’sperformance with an input of the 2.6 million extrac-tions described in section 5.1, using Resolver’s de-fault parameters.

ResolverNYT: Since Resolver is supposed toperform better when given more accurate statis-tics from a larger corpus, we tried giving it moredata. Specifically, we ran ReVerb on 1.8 million NYTimes articles published between 1987 and 2007 ob-tain 60 million extractions (Sandhaus, 2008). We ranResolver on the union of this and our standard testset, but report performance only on clusters whoserelations were seen in our news stream.

System all diverseprec #rels prec #rels

Resolver 0.78 129 0.65 57ResolverNyt 0.64 1461 0.52 841

ResolverNytTop 0.83 207 0.72 79Cosine 0.65 17 0.33 9

CosineNyt 0.56 73 0.46 59NEWSSPIKE 0.93 24843 0.87 5574

Table 2: Comparison with methods using the distribu-tional hypothesis

ResolverNytTop: Resolver is designed toachieve good performance on its top results. We thusranked the ResolverNYT outputs by their scores andreport the precision of the top 100 clusters.

Cosine: Cosine similarity is a basic metric forthe distributional hypothesis. This system employsthe same setup as Resolver in order to generateparaphrase clusters, except that Resolver’s similar-ity metric is replaced with the cosine. Each relationis represented by a vector of argument pairs. Thesimilarity threshold to merge two clusters was 0.5.

CosineNYT: As for ResolverNYT, we ran Cosi-neNYT with an extra 60 million extractions and re-ported the performance on relations seen in our newsstream.

We measured the precision of each system bymanually labeling all output if 100 or fewer clus-ters were generated (e.g. ResolverNytTop), other-wise 100 randomly chosen clusters were sampled.Annotators first determined the meaning of everyoutput cluster and then created a gold cluster bychoosing the correct relations. The gold clustercould be empty if the output cluster was nonsensi-cal. Unlike many papers that simply report recall onthe most frequent relations, we evaluated the totalnumber of returned relations in the output clusters.As in Section 5.2, we also report numbers for thecase of lexically diverse relation phrases.

As can be seen in Table 2, NEWSSPIKE outper-formed methods based on the distributional hypoth-esis. The performance of the Cosine and Cosi-neNyt was very low, suggesting that simple simi-larity metrics are insufficient for handling the para-phrasing problem, even when large-scale input is in-volved. Resolver and ResolverNyt employ an ad-vanced similarity measurement and achieve betterresults. However, it is surprising that Resolver re-sults in a greater precision than ResolverNyt. It

Page 9: Harvesting Parallel News Streams to Generate Paraphrases of

is possible that argument pairs from news streamsspanning 20 years sometimes provide incorrect ev-idence for paraphrasing. For example, there wereextractions like (the Rangers, be third in, the NHL)and (the Rangers, be fourth in, the NHL) from newsin 2007 and 2003 respectively. Using these phrases,ResolverNyt produced the incorrect cluster {be thirdin, be fourth in}. NEWSSPIKE achieves greater pre-cision than even the best results from ResolverNyt-Top, because NEWSSPIKE successfully captures thetemporal heuristics, and does not confuse synonymswith antonyms, or causes with effects. NEWSSPIKE

also returned on order of magnitude more relationsthan other methods.

5.4 DiscussionUnlike some domain-specific clustering methods,we tested on all relation phrases extracted by Ope-nIE on the collected news streams. There are norestrictions on the types of relations. Output para-phrases cover a broad range, including politics,sports, entertainment, health, science, etc. Thereare 10 thousand nonempty clusters over 17 thousanddistinct phrases with average size 2.4. Unlike meth-ods based on distributional similarity, NewsSpikecorrectly clusters infrequently appearing phrases.

Since we focus on high precision, it is not sur-prising that most clusters are of size 2 and 3. Thesehigh precision clusters can contribute a lot to gen-erate larger paraphrase clusters. For example, onecan invent the technique to merge smaller clusterstogether. The work presented here provides a foun-dation for future work to more closely examine thesechallenges.

While this paper gives promising results, thereare still behaviors found in news streams that provechallenging. Many errors are due to the discoursecontext: the two sentences are synonymous in thegiven EEC-set, but the relation phrases are notparaphrases in general. For example, consider thefollowing two sentences: “DA14 narrowly missesEarth” and “DA14 flies so close to Earth”. Statis-tics information from large corpus would be helpfulto handle such challenges. Note in this paper, in or-der to fairly compare with the distributional hypoth-esis, we purposely forced NEWSSPIKE not to relyon any distributional similarity. But NEWSSPIKE’sgraphical model has the flexibility to incorporate anysimilarity metrics as features. Such a hybrid model

has great potential to increase both precision and re-call, which is one goal for future work.

6 Related Work

The vast majority of paraphrasing work falls intotwo categories: approaches based on the distribu-tional hypothesis or those exploiting on correspon-dences between parallel corpora (Androutsopoulosand Malakasiotis, 2010; Madnani and Dorr, 2010).

Using Distribution Similarity: Lin and Pan-tel’s (2001) DIRT employ mutual information statis-tics to compute the similarity between relations rep-resented in dependency paths. Resolver (Yates andEtzioni, 2009) introduces a new similarity metriccalled the Extracted Shared Property (ESP) and usesa probabilistic model to merge ESP with surfacestring similarity.

Identifying the semantic equivalence of relationphrases is also called relation discovery or unsu-pervised semantic parsing. Often techniques don’tcompute the similarity explicitly but rely implic-itly on the distributional hypothesis. Poon andDomingos’ (2009) USP clusters relations repre-sented with fragments of dependency trees by re-peatedly merging relations having similar context.Yao et al. (2011; 2012) introduces generative mod-els for relation discovery using LDA-style algorithmover a relation-feature matrix. Chen et al. (2011) fo-cuses on domain-dependent relation discovery, ex-tending a generative model with meta-constraintsfrom lexical, syntactic and discourse regularities.

Our work solves a major problem with these ap-proaches, avoiding errors such as confusing syn-onyms with antonyms and causes with effects. Fur-thermore, NEWSSPIKE doesn’t require massive sta-tistical evidence as do most approaches based on thedistributional hypothesis.

Using Parallel Corpora: Comparable and par-allel corpora, including news streams and multipletranslations of the same story, have been used togenerate paraphrases, both sentential (Barzilay andLee, 2003; Dolan et al., 2004; Shinyama and Sekine,2003) and phrasal (Barzilay and McKeown, 2001;Shen et al., 2006; Pang et al., 2003). Typical meth-ods first gather relevant articles and then pair sen-tences that are potential paraphrases. Given a train-ing set of paraphrases, models are learned and ap-plied to unlabeled pairs (Dolan and Brockett, 2005;

Page 10: Harvesting Parallel News Streams to Generate Paraphrases of

Socher et al., 2011). Phrasal paraphrases are oftenobtained by running an alignment algorithm over theparaphrased sentence pairs.

While prior work uses the temporal aspects ofnews streams as a coarse filter, it largely relies ontext metrics, such as context similarity and edit dis-tance, to make predictions and alignments. Thesemetrics are usually insufficient to produce high pre-cision results; moreover they tend to produce para-phrases that are simple lexical variants (e.g. {go to,go into}.). In contrast, NEWSSPIKE generates para-phrase clusters with both high precision and high di-versity.

Others: Textual entailment (Dagan et al., 2009),which finds a phrase implying another phrase,is closely related to the paraphrasing task. Be-rant et al. (2011) notes the flaws in distributionalsimilarity and proposes local entailment classi-fiers, which are able to combine many features.Lin et al. (2012) also uses temporal information todetect the semantics of entities. In a manner similarto our approach, Recasens et al. (2013) mines paral-lel news stories to find opaque coreferent mentions.

7 Conclusion

Paraphrasing event relations is crucial to many natu-ral language processing applications, including re-lation extraction, question answering, summariza-tion, and machine translation. Unfortunately, previ-ous approaches based on distribution similarity andparallel corpora, often produce low precision clus-ters. This paper introduces three Temporal Corre-spondence Heuristics that characterize semanticallyequivalent phrases in news streams. We present anovel algorithm, NEWSSPIKE, based on a proba-bilistic graphical model encoding these heuristics,which harvests high-quality paraphrases of event re-lations.

Experiments show NEWSSPIKE’s improvementrelative to several other methods, especially at pro-ducing lexically diverse clusters. Ablation testsconfirm that our temporal features are crucial toNEWSSPIKE’s precision. In order to spur futureresearch, we are releasing an annotated corpus oftime-stamped news articles and our harvested rela-tion clusters.

Acknowledgments

We thank Oren Etzioni, Anthony Fader, RaphaelHoffmann, Ben Taskar, Luke Zettlemoyer, and theanonymous reviewers for providing valuable ad-vice. We also thank Shengliang Xu for annotat-ing the datasets. We gratefully acknowledge thesupport of the Defense Advanced Research ProjectsAgency (DARPA) Machine Reading Program underAir Force Research Laboratory (AFRL) prime con-tract no. FA8750-09-C-0181, ONR grant N00014-12-1-0211, a gift from Google, and the WRF / TJCable Professorship. Any opinions, findings, andconclusions or recommendations expressed in thismaterial are those of the author(s) and do not neces-sarily reflect the view of DARPA, AFRL, or the USgovernment.

References

Ion Androutsopoulos and Prodromos Malakasiotis.2010. A survey of paraphrasing and textual entail-ment methods. In Journal of Artificial Intelligence Re-search, pages 135–187.

Regina Barzilay and Lillian Lee. 2003. Learning toparaphrase: An unsupervised approach using multiple-sequence alignment. In HLT-NAACL, pages 16–23.Association for Computational Linguistics.

Regina Barzilay and Kathleen R McKeown. 2001. Ex-tracting paraphrases from a parallel corpus. In ACL,pages 50–57. Association for Computational Linguis-tics.

Regina Barzilay, Kathleen R McKeown, and Michael El-hadad. 1999. Information fusion in the context ofmulti-document summarization. In ACL, pages 550–557. Association for Computational Linguistics.

Jonathan Berant, Ido Dagan, and Jacob Goldberger.2011. Global learning of typed entailment rules. InACL-HLT, pages 610–619. Association for Computa-tional Linguistics.

Rahul Bhagat and Deepak Ravichandran. 2008. Largescale acquisition of paraphrases for learning surfacepatterns. In ACL, volume 8, pages 674–682. Associa-tion for Computational Linguistics.

Chris Callison-Burch, Philipp Koehn, and Miles Os-borne. 2006. Improved statistical machine translationusing paraphrases. In NAACL, pages 17–24. Associa-tion for Computational Linguistics.

Harr Chen, Edward Benson, Tahira Naseem, and ReginaBarzilay. 2011. In-domain relation discovery withmeta-constraints via posterior regularization. In ACL-HLT, pages 530–540. Association for ComputationalLinguistics.

Page 11: Harvesting Parallel News Streams to Generate Paraphrases of

Michael Collins. 2002. Discriminative training methodsfor hidden markov models: Theory and experimentswith perceptron algorithms. In ACL, pages 1–8. Asso-ciation for Computational Linguistics.

Ido Dagan, Bill Dolan, Bernardo Magnini, and Dan Roth.2009. Recognizing textual entailment: Rational, eval-uation and approaches. Natural Language Engineer-ing, 15(04):i–xvii.

William B Dolan and Chris Brockett. 2005. Automat-ically constructing a corpus of sentential paraphrases.In Proceedings of IWP.

Bill Dolan, Chris Quirk, and Chris Brockett. 2004.Unsupervised construction of large paraphrase cor-pora: Exploiting massively parallel news sources. InComputational Linguistics, page 350. Association forComputational Linguistics.

Anthony Fader, Stephen Soderland, and Oren Etzioni.2011. Identifying relations for open information ex-traction. In EMNLP. Association for ComputationalLinguistics, July 27-31.

Anthony Fader, Luke Zettlemoyer, and Oren Etzioni.2013. Paraphrase-driven learning for open questionanswering. In ACL. Association for ComputationalLinguistics.

William A Gale, Kenneth W Church, and DavidYarowsky. 1992. One sense per discourse. In Pro-ceedings of the workshop on Speech and Natural Lan-guage, pages 233–237. Association for ComputationalLinguistics.

Zellig S Harris. 1954. Distributional structure. Word.Raphael Hoffmann, Congle Zhang, Xiao Ling, Luke

Zettlemoyer, and Daniel S Weld. 2011. Knowledge-based weak supervision for information extraction ofoverlapping relations. In ACL-HLT, pages 541–550.

Christian Kohlschutter, Peter Fankhauser, and WolfgangNejdl. 2010. Boilerplate detection using shallow textfeatures. In WSDM, pages 441–450. ACM.

Dekang Lin and Patrick Pantel. 2001. Discovery of infer-ence rules for question-answering. Natural LanguageEngineering, 7(4):343–360.

Thomas Lin, Oren Etzioni, et al. 2012. No noun phraseleft behind: detecting and typing unlinkable entities.In EMNLP, pages 893–903. Association for Computa-tional Linguistics.

Nitin Madnani and Bonnie J Dorr. 2010. Gener-ating phrasal and sentential paraphrases: A surveyof data-driven methods. Computational Linguistics,36(3):341–387.

Bo Pang, Kevin Knight, and Daniel Marcu. 2003.Syntax-based alignment of multiple translations: Ex-tracting paraphrases and generating new sentences. InNAACL, pages 102–109. Association for Computa-tional Linguistics.

Hoifung Poon and Pedro Domingos. 2009. Unsuper-vised semantic parsing. In EMNLP, pages 1–10. As-sociation for Computational Linguistics.

Altaf Rahman and Vincent Ng. 2009. Supervised mod-els for coreference resolution. In EMNLP, pages 968–977. Association for Computational Linguistics.

Marta Recasens, Matthew Can, and Dan Jurafsky. 2013.Same referent, different words: Unsupervised min-ing of opaque coreferent mentions. In Proceedings ofNAACL-HLT, pages 897–906.

Evan Sandhaus. 2008. The New York Times annotatedcorpus. Linguistic Data Consortium.

Siwei Shen, Dragomir R Radev, Agam Patel, and GunesErkan. 2006. Adding syntax to dynamic program-ming for aligning comparable texts for the generationof paraphrases. In Proceedings of the COLING/ACLon Main conference poster sessions, pages 747–754.Association for Computational Linguistics.

Yusuke Shinyama and Satoshi Sekine. 2003. Para-phrase acquisition for information extraction. In Pro-ceedings of the second international workshop onParaphrasing-Volume 16, pages 65–71. Associationfor Computational Linguistics.

Richard Socher, Eric H Huang, Jeffrey Pennington, An-drew Y Ng, and Christopher D Manning. 2011. Dy-namic pooling and unfolding recursive autoencodersfor paraphrase detection. NIPS, 24:801–809.

Mihalis Yannakakis. 1992. On the approximation ofmaximum satisfiability. In Proceedings of the third an-nual ACM-SIAM symposium on Discrete algorithms,SODA ’92, pages 1–9.

Limin Yao, Aria Haghighi, Sebastian Riedel, and AndrewMcCallum. 2011. Structured relation discovery usinggenerative models. In EMNLP, pages 1456–1466. As-sociation for Computational Linguistics.

Limin Yao, Sebastian Riedel, and Andrew McCallum.2012. Unsupervised relation discovery with sense dis-ambiguation. In ACL, pages 712–720. Association forComputational Linguistics.

Alexander Yates and Oren Etzioni. 2009. Unsupervisedmethods for determining object and relation synonymson the web. Journal of Artificial Intelligence Research,34(1):255.