11
Proceedings of NAACL-HLT 2019, pages 785–795 Minneapolis, Minnesota, June 2 - June 7, 2019. c 2019 Association for Computational Linguistics 785 Improving Event Coreference Resolution by Learning Argument Compatibility from Unlabeled Data Yin Jou Huang 1 , Jing Lu 2 , Sadao Kurohashi 1 , Vincent Ng 2 1 Graduate School of Informatics, Kyoto University 2 Human Language Technology Research Institute, University of Texas at Dallas [email protected], [email protected] {ljwinnie,vince}@hlt.utdallas.edu Abstract Argument compatibility is a linguistic condi- tion that is frequently incorporated into mod- ern event coreference resolution systems. If two event mentions have incompatible argu- ments in any of the argument roles, they can- not be coreferent. On the other hand, if these mentions have compatible arguments, then this may be used as information toward deciding their coreferent status. One of the key chal- lenges in leveraging argument compatibility lies in the paucity of labeled data. In this work, we propose a transfer learning framework for event coreference resolution that utilizes a large amount of unlabeled data to learn the ar- gument compatibility between two event men- tions. In addition, we adopt an interactive in- ference network based model to better capture the (in)compatible relations between the con- text words of two event mentions. Our exper- iments on the KBP 2017 English dataset con- firm the effectiveness of our model in learn- ing argument compatibility, which in turn im- proves the performance of the overall event coreference model. 1 Introduction Events are essential building blocks of all kinds of natural language text. An event can be described several times from different aspects in the same document, resulting in multiple surface forms of event mentions. The goal of event coreference resolution is to identify event mentions that cor- respond to the same real-world event. This task is critical for natural language processing applica- tions that require deep text understanding, such as storyline extraction/generation, text summariza- tion, question answering, and information extrac- tion. Figure 1 shows a document consisting of three events described by six different event mentions. Among these event mentions, m 1 , m 2 and m 4 are Figure 1: A document with three events described in six event mentions. Coreferent event mentions are highlighted with the same color. coreferent, since they all correspond to the event of the KMT party electing a new party chief. Sim- ilarly, m 3 and m 5 are also coreferent, while m 6 is not coreferent with any other event mentions. An event mention consists of a trigger and zero or more arguments. The trigger of an event men- tion is the word/phrase that is considered the most representative of the event, such as the word meet- ing for m 3 or the word elected for m 6 . Triggers of coreferent event mentions must be related, that is, they should describe the same type of events. For example, m 1 and m 3 cannot be coreferent, since their trigger words — elect and meeting — are not related. Arguments are the participants of an event, each having its role. For example, KMT is the AGENT- argument and new party chief is the PATIENT- argument of m 1 . Argument compatibility is an important linguistic condition for determining the coreferent status between two event mentions. Two arguments are incompatible if they do not correspond to the same real-world entity when they are expressed in the same level of specificity;

Improving Event Coreference Resolution by Learning ... · Conky et al.,2012;Cybulska and Vossen,2013; Araki et al.,2014;Liu et al.,2014;Krause et al., 2016). However, these features

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

  • Proceedings of NAACL-HLT 2019, pages 785–795Minneapolis, Minnesota, June 2 - June 7, 2019. c©2019 Association for Computational Linguistics

    785

    Improving Event Coreference Resolution by Learning ArgumentCompatibility from Unlabeled Data

    Yin Jou Huang1, Jing Lu2, Sadao Kurohashi1, Vincent Ng21Graduate School of Informatics, Kyoto University

    2Human Language Technology Research Institute, University of Texas at [email protected], [email protected]

    {ljwinnie,vince}@hlt.utdallas.edu

    Abstract

    Argument compatibility is a linguistic condi-tion that is frequently incorporated into mod-ern event coreference resolution systems. Iftwo event mentions have incompatible argu-ments in any of the argument roles, they can-not be coreferent. On the other hand, if thesementions have compatible arguments, then thismay be used as information toward decidingtheir coreferent status. One of the key chal-lenges in leveraging argument compatibilitylies in the paucity of labeled data. In this work,we propose a transfer learning framework forevent coreference resolution that utilizes alarge amount of unlabeled data to learn the ar-gument compatibility between two event men-tions. In addition, we adopt an interactive in-ference network based model to better capturethe (in)compatible relations between the con-text words of two event mentions. Our exper-iments on the KBP 2017 English dataset con-firm the effectiveness of our model in learn-ing argument compatibility, which in turn im-proves the performance of the overall eventcoreference model.

    1 Introduction

    Events are essential building blocks of all kinds ofnatural language text. An event can be describedseveral times from different aspects in the samedocument, resulting in multiple surface forms ofevent mentions. The goal of event coreferenceresolution is to identify event mentions that cor-respond to the same real-world event. This taskis critical for natural language processing applica-tions that require deep text understanding, such asstoryline extraction/generation, text summariza-tion, question answering, and information extrac-tion.

    Figure 1 shows a document consisting of threeevents described by six different event mentions.Among these event mentions, m1, m2 and m4 are

    Figure 1: A document with three events described insix event mentions. Coreferent event mentions arehighlighted with the same color.

    coreferent, since they all correspond to the eventof the KMT party electing a new party chief. Sim-ilarly, m3 and m5 are also coreferent, while m6 isnot coreferent with any other event mentions.

    An event mention consists of a trigger and zeroor more arguments. The trigger of an event men-tion is the word/phrase that is considered the mostrepresentative of the event, such as the word meet-ing for m3 or the word elected for m6. Triggers ofcoreferent event mentions must be related, that is,they should describe the same type of events. Forexample, m1 and m3 cannot be coreferent, sincetheir trigger words — elect and meeting — arenot related.

    Arguments are the participants of an event, eachhaving its role. For example, KMT is the AGENT-argument and new party chief is the PATIENT-argument of m1. Argument compatibility is animportant linguistic condition for determining thecoreferent status between two event mentions.Two arguments are incompatible if they do notcorrespond to the same real-world entity whenthey are expressed in the same level of specificity;

  • 786

    Figure 2: System overview.

    otherwise, they are compatible. For example, apair of TIME-arguments — Wednesday and 2005— which are expressed in different level of speci-ficity, are considered compatible. If two eventmentions have incompatible arguments in somespecific argument roles, they cannot be coreferent.For example, m2 and m6 are not coreferent sincetheir TIME-arguments — January 2012 and 2005— and their PATIENT-arguments — a new chair-person and Ma — are incompatible. On the otherhand, coreferent event mentions can only havecompatible arguments. For example, m3 and m5both have Wednesday as TIME-arguments. In thisexample, argument compatibility in the TIME ar-gument role is a strong hint suggesting their coref-erence.

    Despite its importance, incorporating argumentcompatibility into event coreference systems ischallenging due to the lack of sufficient labeleddata. Many existing works have relied on imple-menting argument extractors as upstream compo-nents and designing argument features that cap-ture argument compatibility in event coreferenceresolvers. However, the error introduced in eachof the steps propagates through these resolvers andhinders their performance considerably.

    In light of the aforementioned challenge, wepropose a framework for transferring argument(in)compatibility knowledge to the event coref-erence resolution system, specifically by adopt-ing the interactive inference network (Gong et al.,2018) as our model structure. The idea is asfollows. First, we train a network to determinewhether the corresponding arguments of an eventmention pair are compatible on automatically la-beled training instances collected from a largeunlabeled news corpus. Second, to transfer theknowledge of argument (in)compatibility to an

    event coreference resolver, we employ the net-work (pre)trained in the previous step as a startingpoint and train it to determine whether two eventmentions are coreferent on manually labeled eventcoreference corpora. Third, we iteratively repeatthe above two steps, where we use the learnedcoreference model to relabel the argument com-patibility instances, retrain the network to deter-mine argument compatibility, and use the result-ing pretrained network to learn an event corefer-ence resolver. In essence, we mutually bootstrapthe argument (in)compatibility determination taskand the event coreference resolution task.

    Our contributions are three-fold. First, we uti-lize and leverage the argument (in)compatibilityknowledge acquired from a large unlabeled cor-pus for event coreference resolution. Second, weemploy the interactive inference network as ourmodel structure to iteratively learn argument com-patibility and event coreference resolution. Ini-tially proposed for the task of natural language in-ference, the interactive inference network is suit-able for capturing the semantic relations betweenword pairs. Experimental results on the KBPcoreference dataset show that this network archi-tecture is also suitable for capturing the argumentcompatibility between event mentions. Third, ourmodel achieves state-of-the-art results on the KBP2017 English dataset (Ellis et al., 2015, 2016; Get-man et al., 2017), which confirms the effectivenessof our method.

    2 Related Work

    Ablation experiments conducted by Chen and Ng(2013) provide empirical support for the useful-ness of event arguments for event coreference res-olution. Hence, it should not be surprising that,with just a few exceptions (e.g., Sangeetha and

  • 787

    event mention DATE-compatibility with mama The result of the election last October surprised everyone. -m1 He was elected as president in 2005. nom2 The presidential election took place on October 20th. yesm3 The opposition party won the election. yes

    Table 1: Examples of NER-based sample filtering. The phrases tagged as DATE are underlined, and the triggerwords are boldfaced.

    Arock (2012); Araki and Mitamura (2015); Lu andNg (2017)), argument features have been exten-sively exploited in event coreference systems tocapture the argument compatibility between twoevent mentions. Basic features such as the num-ber of overlapping arguments and the number ofunique arguments, and a binary feature encodingwhether arguments are conflicting have been pro-posed (Chen et al., 2009; Chen and Ji, 2009; Chenand Ng, 2016). More sophisticated features basedon different kinds of similarity measures havealso been considered, such as the surface similar-ity based on Dice coefficient and the WuPalmerWordNet similarity between argument heads (Mc-Conky et al., 2012; Cybulska and Vossen, 2013;Araki et al., 2014; Liu et al., 2014; Krause et al.,2016). However, these features are computed us-ing either the outputs of event argument extrac-tors and entity coreference resolvers (Ahn, 2006;Chen and Ng, 2014, 2015; Lu and Ng, 2016) or se-mantic parsers (Bejan and Harabagiu, 2014; Yanget al., 2015; Peng et al., 2016) and therefore suf-fer from serious error propagation issues (see Luand Ng (2018)). Several previous works proposedjoint models to address this problem (Lee et al.,2012; Lu et al., 2016), while others utilized iter-ative methods to propagate argument information(Liu et al., 2014; Choubey and Huang, 2017) inorder to alleviate this issue. However, all of thesemethods still rely on argument extractors to iden-tify arguments and their roles.

    3 Method

    Our proposed transfer learning framework con-sists of two learning stages, the pretraining stageof an argument compatibility classifier and thefine-tuning stage of an event coreference resolver(Figure 2). We provide the details of both stagesin sections 3.1 and 3.2, and describe the iterativestrategy combining the two training stages in sec-tion 3.3. Details on the model structure are cov-ered in section 3.4.

    3.1 Argument Compatibility Learning

    In the pretraining stage, we train the model as anargument compatibility classifier with event men-tions extracted from a large unlabeled news cor-pus.

    Task definition Given a pair of event mentions(ma, mb) with related triggers, predict whethertheir arguments are compatible or not.

    Here, an event mention is represented by a trig-ger word and the context words within an n-wordwindow around the trigger.

    Related trigger extraction We analyze theevent coreference resolution corpus and extracttrigger pairs that are coreferent more than k timesin the training data. We define these trigger pairsto be related triggers in our experiment. In thiswork, we set k to 10. Table 2 shows some exam-ples of related triggers with high counts.

    trigger pair countkill - death 86

    shoot - shooting 35retire - retire 34

    demonstration - protest 30

    Table 2: Examples of related triggers.

    If the triggers of an event mention pair are re-lated, their coreferent status cannot be determinedby looking at the triggers alone, and this is thecase in which argument compatibility affects thecoreferent status most directly. Thus, we focus onthe event mention pairs with related triggers in thepretraining stage of argument compatibility learn-ing.

    Compatible samples extraction From eachdocument, we extract event mention pairs withrelated triggers and check whether the followingconditions are satisfied:

    1. DATE-compatibility (Table 1):First, we perform named entity recognition

  • 788

    (NER) on the context words. If both eventmentions have phrases tagged as DATE inthe context, these two phrases must contain atleast one overlapping word. If there are mul-tiple phrases tagged as DATE in the context,only the phrase closest to the trigger word isconsidered.

    2. PERSON-compatibility: Similar to 1.

    3. NUMBER-compatibility: Similar to 1.

    4. LOCATION-compatibility: Similar to 1.

    5. Apart from function words, the ratio of over-lapping words in their contexts must be un-der 0.3 for both event mentions. We add thisconstraint in order to remove trivial samplesof nearly identical sentences.

    Conditions 1–4 are heuristic filtering rulesbased on NER tags, which aim to remove sampleswith apparent incompatibilities. Here, we considerfour NER types — DATE, PERSON, NUMBER,and LOCATION — because these types of wordsare the most salient types of incompatibility thatcan be observed between event mentions. Condi-tion 5 aims to remove event mention pairs that are“too similar”. We add this condition because wedo not want our model to base its decisions on thenumber of overlapping words between the eventmentions.

    We collect event mention pairs satisfying all theabove conditions as our initial set of compatiblesamples.

    Incompatible sample extraction From differ-ent documents in the corpus, we extract eventmentions with related triggers and check whetherthe following conditions are satisfied:

    1. The creation date of the two documents mustbe at least one month apart.

    2. Apart from the trigger words and the functionwords, the context of the event mentions mustcontain at least one overlapping word.

    In the unlabeled news corpus, articles describ-ing similar news events are sometimes present.Thus, we use condition 1 to roughly assure that theevent mention pairs extracted are not coreferent.Mention pairs extracted from the same documenttend to contain overlapping content words, so toprevent our model to make decisions based on the

    existence of overlapping words, we add condition2 as a constraint.

    We collect event mention pairs satisfying all theabove conditions as our initial set of incompatiblesamples.

    Argument compatibility classifier With theinitial set of compatible and incompatible samplesacquired above, we train a binary classier to dis-tinguish between samples of the two sets.

    3.2 Event Coreference Learning

    In the fine-tuning stage, we adapt the argumentcompatibility classifier on the labeled event coref-erence data to a mention-pair event coreferencemodel.

    3.2.1 Event Mention DetectionBefore proceeding to the task of event coreferenceresolution, we have to identify the event mentionsin the documents. We train a separate event men-tion detection model to identify event mentionsalong with their subtypes.

    We model event mention detection as a multi-class classification problem. Given a candidateword along with its context, we predict the subtypeof the event mention triggered by the word. If thegiven candidate word is not a trigger, we label it asNULL. We select the words that have appeared asa trigger at least once in the training data as candi-date trigger words. We do not consider multi-wordtriggers in this work.

    Given an input sentence, we first represent eachof its comprising words by the concatenation ofthe word embedding and the character embeddingof the word. These representation vectors are fedinto a bidirectional LSTM (biLSTM) layer to ob-tain the hidden representation of each word.

    For each candidate word in the sentence, its hid-den representation is fed into the inference layer topredict the class label. Since the class distributionis highly unbalanced, with the NULL label signif-icantly outnumbering all the other labels, we use aweighted softmax at the inference layer to obtainthe probability of each class. In this work, we setthe weight to 0.1 for the NULL class label and 1for all the other class labels.

    Intuitively, candidate triggers with the same sur-face form in the same document tend to have thesame class label. However, it is difficult to modelthis consistency since our model operates at thesentence level. Thus, we account for this con-

  • 789

    sistency across sentences by the following post-processing step: If a candidate word is assignedthe NULL label but more than half of the candi-dates sharing the same surface form is detectedas triggers of a specific subtype, then we changethe label to this given subtype. Also, we disre-gard event mentions with types contact, movementand transaction in this post-processing step, sincethe subtypes under these three types do not have agood consistency across different sentences in thesame document.

    3.2.2 Mention-Pair Event Coreference Model

    With the argument compatibility classifier trainedin the previous stage, we use the labeled eventcoreference corpus to fine-tune the model into anevent coreference resolver. We design the eventcoreference resolver to be a mention-pair model(Soon et al., 2001), which takes a pair of eventmentions as the input and outputs the likelihoodof them being coreferent.

    With the pairwise event coreference predictions,we further conduct best-first clustering (Ng andCardie, 2002) on the pairwise results to buildthe event coreference clusters of each document.Best-first clustering is an agglomerative clusteringalgorithm that links each event mention to the an-tecedent event mention with the highest corefer-ence likelihood given the likelihood is above anempirically determined threshold.

    3.3 Iterative Relabeling Strategy

    Previously, we collected a set of compatible eventmentions from the same document with simpleheuristic filtering. Despite this filtering step, theinitial compatible set is noisy. Here, we intro-duce an iterative relabeling strategy to improve thequality of the compatible set of event mentions.

    First, we calculate the coreference likelihood ofthe event mentions in the initial compatible set.Mention pairs with a coreference likelihood abovethreshold θM are added to the new compatible set.On the other hand, mention pairs with a corefer-ence likelihood below θm are added to the initialincompatible set to form the new incompatible set.With the new compatible and incompatible sets,we can start another iteration of transfer learningto train a coreference resolver with improved qual-ity. In this work, we set θM to 0.8 and θm to 0.2.

    3.4 Model StructureWe adopt an interactive inference network as themodel structure of our proposed method (Figure3). A qualitative analysis of an interactive in-ference network shows that it is good at captur-ing word overlaps, antonyms and paraphrases be-tween sentence pairs (Gong et al., 2018). Thus, webelieve this network is suitable for capturing theargument compatibility between two event men-tions. The model consists of the following com-ponents:

    Model inputs The input to the model is a pairof event mentions (ma, mb), with ma being theantecedent mention of mb:

    ma = {w1a, w2a, ..., wNa }mb = {w1b , w2b , ..., wNb }

    (1)

    Each event mention is represented by a sequenceof N tokens consisting of one trigger word and itscontext. Here, we take the context to be the wordswithin an n-word window around the trigger. Inthis work, n is set to 10.

    Embedding layer We represent each input to-ken by the concatenation of the following compo-nents:

    Word embedding The word representation ofthe given token. We use pretrained word vectors toinitialize the word embedding layer.

    Character embedding To identify(in)compatibilities regarding person, orga-nization or location names, the handling ofout-of-vocabulary (OOV) words is critical.

    Adding character-level embeddings can allevi-ate the OOV problem (Yang et al., 2017). Thus,we apply a convolutional neural network over thecomprising characters of each token to acquire thecorresponding character embedding.

    POS and NER one-hot vectors One-hot vec-tors of the part-of-speech (POS) tag and NER tag.

    Exact match A binary feature indicatingwhether a given token appears in the context ofboth event mentions. This feature is proved usefulfor several NLP tasks operating on pairs of texts(Chen et al., 2017; Gong et al., 2018; Pan et al.,2018).

    Trigger position We encode the position ofthe trigger word by adding a binary feature to in-dicate whether a given token is a trigger word.

  • 790

    Figure 3: Model structure.

    Encoding layer We pass the sequence of em-bedding vectors into a biLSTM layer (Hochreiterand Schmidhuber, 1997), resulting in a sequenceof hidden vectors of size |h|:

    hia = biLSTM(emb(wia), h

    i−1a )

    hib = biLSTM(emb(wib), h

    i−1b )

    (2)

    where emb(w) is the embedding vector of tokenw.

    Interaction layer The interaction layer capturesthe relations between two event mentions based onthe hidden vectors ha and hb. The interaction ten-sor I , a 3-D tensor of shape (N , N , |h|), is calcu-lated by taking the pairwise multiplication of thecorresponding hidden vectors:

    Iij = hia ◦ h

    jb (3)

    Finally, we apply a multi-layer convolutional neu-ral network to extract the event pair representationvector fev.

    Inference layer In the pretraining stage, we feedfev to a fully-connected inference layer to make abinary prediction of argument compatibility.

    As for the fine-tuning stage, we concatenate anauxiliary feature vector faux to fev before feedingit into the inference layer. faux consists of two fea-tures, a one-hot vector that encodes the sentencedistance between the two event mentions and thedifference of the word embedding vectors of thetwo triggers.

    4 Evaluation

    4.1 Experimental Setup

    4.1.1 CorporaWe use English Gigaword (Parker et al., 2009) asthe unlabeled corpus for argument compatibilitylearning. This corpus consists of the news arti-cles from five news sources, each annotated withits creation date.

    As for event coreference resolution, we usethe English portion of the KBP 2015 and 2016datasets (Ellis et al., 2015, 2016) for training, andthe KBP 2017 dataset (Getman et al., 2017) forevaluation. The KBP datasets comprise news arti-cles and discussion forum threads. The KBP 2015,2016, and 2017 corpora contain 648, 169, and 167documents, respectively. Each document is anno-tated with event mentions of 9 types and 18 sub-types, along with the coreference clusters of theseevent mentions.

    4.1.2 Implementation DetailsPreprocessing We use the Stanford CoreNLPtoolkit (Manning et al., 2014) to perform prepro-cessing on the input data.

    Network structure Each word embedding isinitialized with the 300-dimensional pretrainedGloVe embedding (Pennington et al., 2014). Thecharacter embedding layer is a combination of an8-dimensional embedding layer and three 1D con-volution layers with a kernel size of 5 with 100filters. The size of the biLSTM layer is 200.The maximum length of a word is 16 characters;shorter words are padded with zero and longer

  • 791

    MUC B3 CEAFe BLANC AVG-FbiLSTM (standard) 29.49 43.15 39.91 24.15 34.18biLSTM (transfer) 33.84 42.91 38.39 26.59 35.43Interact (standard) 31.12 42.84 39.01 24.99 34.49Interact (transfer) 34.28 42.93 39.95 32.12 36.24Interact (transfer, 2nd iter) 35.66 43.20 40.02 32.43 36.75Interact (transfer, 3rd iter) 36.05 43.07 39.69 28.06 36.72Jiang et al. (2017) 30.63 43.84 39.86 26.97 35.33

    Table 3: Event coreference resolution results of our proposed system, compared with the biLSTM baseline modeland the current state-of-the-art system.

    words are cropped. For the interaction layer, weuse convolution layers with a kernel size of 3 incombination with max-pooling layers. The size ofthe inference layer is 128. Sigmoid activation isused for the inference layer, and all other layersuse ReLU as the activation function.

    Event mention detection model For word em-beddings, we use the concatenation of a 300-dimensional pretrained GloVe embedding and the50-dimensional embedding proposed by Turianet al. (2010). The character embedding layer is acombination of an 8-dimensional embedding layerand three 1D convolution layers with kernel sizesof 3, 4, 5 with 50 filters.

    4.1.3 Evaluation MetricsWe follow the standard evaluation setup adoptedin the official evaluation of the KBP event nuggetdetection and coreference task. This evaluationsetup is based on four distinct scoring measures —MUC (Vilain et al., 1995) , B3 (Bagga and Bald-win, 1998), CEAFe (Luo, 2005) and BLANC (Re-casens and Hovy, 2011) — and the unweighted av-erage of their F-scores (AVG-F). We use AVG-Fas the main evaluation measure when comparingsystem performances.

    4.2 Results

    We present the experimental results on the KBP2017 corpus in Table 3. In the following, we com-pare the performance of methods with differentnetwork architectures and experimental settings.

    Comparison of network architectures Wecompare the results of the interactive inferencenetwork (Interact) with the biLSTM baselinemodel (biLSTM).

    The biLSTM baseline model does not have theinteraction layer. Instead, the last hidden vectorsof the biLSTM layer are concatenated and fed intothe inference layer directly.

    When trained solely on the event coreferencecorpus (standard), the model with the interactiveinference network performs slightly better than thebiLSTM baseline model, as shown in rows 1 and3. However, with an additional pretraining stepof argument compatibility learning (transfer), theinteract inference network outperforms the biL-STM baseline model by a considerable margin, asshown in rows 2 and 4. We conclude that the in-teractive inference network can better capture thecomplex interactions between two event mentions,accounting for the difference in performance.

    Effect of transfer learning Regardless of thenetwork structure, we observe a considerableimprovement in performance by pretraining themodel as an argument compatibility classifier. ThebiLSTM baseline model achieves an improvementof 1.25 points in AVG-F by doing transfer learn-ing, as can be seen in rows 1 and 2. As for theinteractive inference network, an improvement of1.75 points in AVG-F is achieved, as can be seenin rows 3 and 4. These results provide sugges-tive evidence that our proposed transfer learningframework, which utilizes a large unlabeled cor-pus to perform argument compatibility learning, iseffective.

    Effect of iterative relabeling We achieve an-other boost in performance by using the trainedevent coreference resolver to relabel the trainingsamples for argument compatibility learning. Thebest result is achieved after two iterations (row 5)with an improvement of 2.26 points in AVG-Fcompared to the standard interactive inference net-work (row 3). However, we are not able to ob-tain further gains with more iterations of relabel-ing (row 6). We speculate that the difference inevent coreference model predictions across differ-ent iterations is not big enough to have a perceiv-able impact, but additional experiments are neededto determine the reason.

  • 792

    Type Event Mention Pair Gold System

    Explicitm1: ... the building where 13 people were killed will be razed, and a memorial ...m2: In that case, George Hennard killed 23 people at a Luby ’s restaurant, ...

    non-coref non-coref

    Explicitm1: Ten relatives of the victims arrived at the airport Sunday before traveling to the city of Jiangshan.m2: On Monday , the victims’ relatives went to the Jiangshan Municipal Funeral Parlor.

    non-coref non-coref

    Implicitm1: ... a young woman protester was brutally slapped while she was demonstrating ...m2: ... explain why a women protester in her 60s was beaten up by policemen ...

    non-coref coref

    Implicitm1: She died from a brain hemorrhage on July 10, 2003, ...m2: ... has denied killing his second wife, whom he says died in a car accident.

    non-coref non-coref

    Implicitm1: Nationwide demonstrations held in France to protest gay marriage.m2: ... to protest against the country’s plan to legalize same-sex marriage.

    coref coref

    Generalm1: ... Connecticut elementary school shooting has reignited the debate over gun control.m2: Gun supporters hold that people, not guns, are to blame for the shootings.

    non-coref coref

    Generalm1: Industrial accidents have injured and killed Foxconn workers, and the company also experienced ...m2: ... explosion in May 2011 at Foxconn ’s Chengdu factory killed three workers ...

    non-coref non-coref

    Table 4: Examples of event pairs with related triggers. Trigger words are boldfaced, and words with(in)compatibility information are colored in blue.

    Comparison with the state of the art Com-paring row 5 and 7, we can see that our methodoutperforms the previous state-of-the-art model(Jiang et al., 2017) by 1.42 points in AVG-F.

    5 Discussion

    In this section, we conduct a qualitative analysisof the outputs of our best-performing system (theInteract (transfer, 2nd iter) system in Table 3) onthe event coreference dataset and the unseen eventmention pairs extracted from the unlabeled corpus.

    5.1 Compatibility Classification

    We focus on the samples with related triggers hav-ing either compatible or incompatible arguments(Table 4). These samples can be roughly classifiedinto the following categories:

    Explicit argument compatibility The existenceof identical/distinct time phrases, numbers, loca-tion names or person names in the context is themost explicit form of (in)compatibility.

    For these event pairs, the existence of identi-cal/distinct phrases with the same NER type is adirect clue toward deciding their coreferent status.Making use of this nature, we perform filtering onthe set of compatible samples acquired from theunlabeled corpus in order to remove samples withexplicit incompatibility.

    Our model can recognize this type of(in)compatibility with a relatively high accu-racy. Both examples shown in Table 4 arepredicted correctly.

    Implicit argument compatibility Event pairswith implicit (in)compatible arguments require ex-ternal knowledge to resolve.

    We present three examples in Table 4. In thefirst example, the knowledge that a woman in her60s is generally not referred to as being young isrequired to determine the incompatibility. Simi-larly, the knowledge that both brain hemorrhageand car accident are causes of people’s death arerequired to classify the second example correctly.

    While the performance on samples with implicit(in)compatibility is not as good as that on sampleswith explicit (in)compatibility, our system is ableto capture implicit (in)compatibility to some ex-tent. We believe that this type of (in)compatibilityis difficult to be captured with the argument fea-tures that are designed based on the outputs ofargument extractors and entity coreference re-solvers, and that the ability to resolve implicit(in)compatibility contributes largely to our sys-tem’s performance improvements.

    General-specific incompatibilities Event men-tions describing general events pose special chal-lenges to the task of event coreference resolution.

    In Table 4, we present two typical examplesof this category. In the first example, the secondevent mention does not refer to any specific shoot-ing event in the real world, in contrast to the firstevent mention, which describes a specific schoolshooting event. Similarly for the second exam-ple, where the first event mention depicts a gen-eral event and the second event mention depicts aspecific one.

    General event mentions typically have few oreven no arguments and modifiers, making theidentification of non-coreference relations verychallenging. Since we cannot rely on argumentcompatibility, a deeper understanding of the se-mantics of the event mentions is needed. General

  • 793

    Event Mention Pair Type System

    I

    m1: What would have happened if Steve Jobs had never left Apple ... - -ma2: ...in the state that is today if John hadn’t left. Explicit non-corefmb2: ...in the state that is today if she hadn’t left. Implicit non-corefmc2: ...in the state that is today if he hadn’t left. Implicit coref

    II

    m1: Police arrest 6 men for gangraping housewife in northern India. - -ma2: Indian police have arrested six men for allegedly gangraping a 29-year-old housewife ... Explicit corefmb2: Indian police have arrested six men for allegedly gangraping a woman ... Implicit corefmc2: Indian police have arrested six men for allegedly gangraping a medical student ... Implicit non-coref

    IIIm1: Nationwide demonstrations in France to protest gay marriage. - -ma2: ...took to the streets across the country to protest against the country’s plan to legalize same-sex marriage. Implicit corefmb2: ...took to the streets across the country to protest against the contentious citizenship amendment bill. Implicit non-coref

    Table 5: Case study on manually-generated event mention pairs. Trigger words are boldfaced, and the targetarguments are colored in blue.

    event mentions account for a considerable fractionof our system’s error, since they are quite perva-sive in both news articles and discussion forumthreads.

    5.2 Case StudyTo better understand the behavior of our system,we perform a case study on manually-generatedevent pairs. Specifically, for a given pair of eventmentions, we first alter only one of the argumentsand keep the rest of the content fixed. We thenobserve the behavior of the system across differentvariations of the altered argument (Table 5).

    Example I In this example, we pick theAGENT-argument as the target and alter theAGENT-argument of the second event mention.The event pair (m1, ma2) is non-coreferent dueto the explicit incompatibility between Steve Jobsand John, and the system’s prediction is also non-coreferent. Further, we alter the target argumentto the pronoun she (mb2), resulting in an implicitincompatibility in the AGENT argument since theSteve Jobs is generally not considered a femininename. As expected, the system classifies the eventpair (m1,mb2) as non-coreferent. Finally, when wealter the target argument to he (mc2), the systemcorrectly classifies the resulting pair as coreferent.

    Example II In this example, we pick thePATIENT-argument as the target and alter thePATIENT-argument of the second event mention.The system classifies the event pair (m1, ma2)as coreferent, which is reasonable consideringthe presence of the explicit compatible argumentshousewife and 29-year-old housewife. Further,when we alter the target argument to woman (mb2),the system output is still coreferent. This is con-sistent with our prediction: the event mentions arelikely to be coreferent judging only from the con-

    text of the two event mentions. However, when wealter the target argument to medical student (mc2),the event pair would become non-coreferent due tothe incompatibility between medical student andhousewife. The system classifies the event paircorrectly.

    Example III In this example, we pick theREASON-argument as the target and alter theREASON-argument of the second event mention.The event pair (m1, ma2) has a pair of implicitcompatible arguments in the REASON-argumentrole and is likely to be coreferent. In contrast, al-tering the target argument to contentious citizen-ship amendment bill (mb2) would yield an pair ofimplicit incompatible arguments, and the resultingevent pair would become non-coreferent. Our sys-tem classifies both event pairs correctly.

    6 Conclusion

    We proposed an iterative transfer learning frame-work for event coreference resolution. Ourmethod exploited a large unlabeled corpus to learna wide range of (in)compatibilities between argu-ments, which contributes to the improvement inperformance on the event coreference resolutiontask. We achieved state-of-the-art results on theKBP 2017 English event coreference dataset, out-performing the previous state-of-the-art system. Inaddition, a qualitative analysis of the system out-put confirmed the ability of our system to capture(in)compatibilities between two event mentions.

    Acknowledgments

    We thank the three anonymous reviewers for theirdetailed comments on an earlier draft of the paper.This work was supported in part by NSF GrantsIIS-1219142 and IIS-1528037 and JST CRESTGrant Number JPMJCR1301, Japan.

  • 794

    ReferencesDavid Ahn. 2006. The stages of event extraction.

    In Proceedings of the COLING/ACL Workshop onAnnotating and Reasoning about Time and Events,pages 1–8.

    Jun Araki, Zhengzhong Liu, Eduard Hovy, and TerukoMitamura. 2014. Detecting subevent structure forevent coreference resolution. In Proceedings ofthe 9th International Conference on Language Re-sources and Evaluation, pages 4553 – 4558.

    Jun Araki and Teruko Mitamura. 2015. Joint event trig-ger identification and event coreference resolutionwith structured perceptron. In Proceedings of the2015 Conference on Empirical Methods in NaturalLanguage Processing, pages 2074–2080.

    Amit Bagga and Breck Baldwin. 1998. Algorithms forscoring coreference chains. In Proceedings of theLREC Workshop on Linguistics Coreference, pages563–566.

    Cosmin Adrian Bejan and Sanda Harabagiu. 2014. Un-supervised event coreference resolution. Computa-tional Linguistics, 40(2):311–347.

    Chen Chen and Vincent Ng. 2013. Chinese eventcoreference resolution: Understanding the state ofthe art. In Proceedings of the 6th International JointConference on Natural Language Processing, pages822–828.

    Chen Chen and Vincent Ng. 2014. SinoCoreferencer:An end-to-end Chinese event coreference resolver.In Proceedings of the 9th International Confer-ence on Language Resources and Evaluation, pages4532–4538.

    Chen Chen and Vincent Ng. 2015. Chinese eventcoreference resolution: An unsupervised probabilis-tic model rivaling supervised resolvers. In Proceed-ings of Human Language Technologies: The 2015Conference of the North American Chapter of theAssociation for Computational Linguistics, pages1097–1107.

    Chen Chen and Vincent Ng. 2016. Joint inference overa lightly supervised information extraction pipeline:Towards event coreference resolution for resource-scarce languages. In Proceedings of the 30th AAAIConference on Artificial Intelligence, pages 2913–2920.

    Danqi Chen, Adam Fisch, Jason Weston, and AntoineBordes. 2017. Reading Wikipedia to answer open-domain questions. In Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 1870–1879.

    Zheng Chen and Heng Ji. 2009. Graph-based eventcoreference resolution. In Proceedings of theACL-IJCNLP Workshop on Graph-based Methodsfor Natural Language Processing (TextGraphs-4),pages 54–57.

    Zheng Chen, Heng Ji, and Robert Haralick. 2009. Apairwise event coreference model, feature impactand evaluation for event coreference resolution. InProceedings of the RANLP Workshop on Events inEmerging Text Types, 3, pages 17–22.

    Prafulla Kumar Choubey and Ruihong Huang. 2017.Event coreference resolution by iteratively unfold-ing inter-dependencies among events. In Proceed-ings of the 2017 Conference on Empirical Methodsin Natural Language Processing, pages 2124–2133.

    Agata Cybulska and Piek Vossen. 2013. Semanticrelations between events and their time, locationsand participants for event coreference resolution. InProceedings of the 9th International Conference onRecent Advances in Natural Language Processing,pages 156–163.

    Joe Ellis, Jeremy Getman, Dana Fore, Neil Kuster,Zhiyi Song, Ann Bies, and Stephanie Strassel. 2015.Overview of linguistic resources for the TAC KBP2015 evaluations: Methodologies and results. InProceedings of the TAC KBP 2015 Workshop.

    Joe Ellis, Jeremy Getman, Neil Kuster, Zhiyi Song,Ann Bies, and Stephanie Strassel. 2016. Overviewof linguistic resources for the TAC KBP 2016 eval-uations: Methodologies and results. In Proceedingsof the TAC KBP 2016 Workshop.

    Jeremy Getman, Joe Ellis, Zhiyi Song, Jennifer Tracey,and Stephanie Strassel. 2017. Overview of linguis-tic resources for the TAC KBP 2017 evaluations:Methodologies and results. In Proceedings of theTAC KBP 2017 Workshop.

    Yichen Gong, Heng Luo, and Jian Zhang. 2018. Nat-ural language inference over interaction space. InProceedings of the 6th International Conference onLearning Representations.

    Sepp Hochreiter and Jüngen Schmidhuber. 1997.Long short-term memory. Neural Computation,9(8):1735–1780.

    Shanshan Jiang, Yihan Li, Tianyi Qin, Qian Meng, andBin Dong. 2017. SRCB entity discovery and linking(EDL) and event nugget systems for TAC 2017. InProceedings of the TAC KBP 2017 Workshop.

    Sebastian Krause, Feiyu Xu, Hans Uszkoreit, and DirkWeissenborn. 2016. Event Linking with SententialFeatures from Convolutional Neural Networks. InProceedings of the 20th Conference on Computa-tional Natural Language Learning, pages 239–249.

    Heeyoung Lee, Marta Recasens, Angel Chang, MihaiSurdeanu, and Dan Jurafsky. 2012. Joint entity andevent coreference resolution across documents. InProceedings of the 2012 Joint Conference on Empir-ical Methods in Natural Language Processing andComputational Natural Language Learning, pages489–500.

    https://doi.org/10.3115/1629235.1629236http://www.lrec-conf.org/proceedings/lrec2014/pdf/963_Paper.pdfhttp://www.lrec-conf.org/proceedings/lrec2014/pdf/963_Paper.pdfhttps://www.aclweb.org/anthology/D15-1247https://www.aclweb.org/anthology/D15-1247https://www.aclweb.org/anthology/D15-1247https://pdfs.semanticscholar.org/ccda/cc60d9d68dfc1f94e7c68bd56646c000e4ab.pdfhttps://pdfs.semanticscholar.org/ccda/cc60d9d68dfc1f94e7c68bd56646c000e4ab.pdfhttps://www.aclweb.org/anthology/J14-2004https://www.aclweb.org/anthology/J14-2004https://www.aclweb.org/anthology/I13-1100https://www.aclweb.org/anthology/I13-1100https://www.aclweb.org/anthology/I13-1100http://www.lrec-conf.org/proceedings/lrec2014/pdf/1144_Paper.pdfhttp://www.lrec-conf.org/proceedings/lrec2014/pdf/1144_Paper.pdfhttps://aclweb.org/anthology/N15-1116https://aclweb.org/anthology/N15-1116https://aclweb.org/anthology/N15-1116https://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12413https://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12413https://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12413https://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12413https://www.aclweb.org/anthology/P17-1171https://www.aclweb.org/anthology/P17-1171http://www.aclweb.org/anthology/W/W09/W09-3208http://www.aclweb.org/anthology/W/W09/W09-3208https://www.aclweb.org/anthology/W09-4303https://www.aclweb.org/anthology/W09-4303https://www.aclweb.org/anthology/W09-4303https://www.aclweb.org/anthology/D17-1226https://www.aclweb.org/anthology/D17-1226https://www.aclweb.org/anthology/R13-1021https://www.aclweb.org/anthology/R13-1021https://www.aclweb.org/anthology/R13-1021https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/tackbp2016-linguistic-resources-tackbp-1.pdfhttps://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/tackbp2016-linguistic-resources-tackbp-1.pdfhttps://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/kbp-2016-overview-paper-V3.pdfhttps://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/kbp-2016-overview-paper-V3.pdfhttps://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/kbp-2016-overview-paper-V3.pdfhttps://tac.nist.gov/publications/2017/additional.papers/TAC2017.KBP_resources_overview.proceedings.pdfhttps://tac.nist.gov/publications/2017/additional.papers/TAC2017.KBP_resources_overview.proceedings.pdfhttps://tac.nist.gov/publications/2017/additional.papers/TAC2017.KBP_resources_overview.proceedings.pdfhttps://openreview.net/forum?id=r1dHXnH6-https://openreview.net/forum?id=r1dHXnH6-https://www.bioinf.jku.at/publications/older/2604.pdfhttps://tac.nist.gov/publications/2017/participant.papers/TAC2017.srcb.proceedings.pdfhttps://tac.nist.gov/publications/2017/participant.papers/TAC2017.srcb.proceedings.pdfhttp://www.aclweb.org/anthology/K16-1024http://www.aclweb.org/anthology/K16-1024http://www.aclweb.org/anthology/D12-1045http://www.aclweb.org/anthology/D12-1045

  • 795

    Zhengzhong Liu, Jun Araki, Eduard Hovy, and TerukoMitamura. 2014. Supervised within-document eventcoreference using information propagation. In Pro-ceedings of the 9th International Conference onLanguage Resources and Evaluation, pages 4539–4544.

    Jing Lu and Vincent Ng. 2016. Event coreference res-olution with multi-pass sieves. In Proceedings ofthe 10th International Conference on Language Re-sources and Evaluation, pages 3996–4003.

    Jing Lu and Vincent Ng. 2017. Joint learning for eventcoreference resolution. In Proceedings of 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 90–101.

    Jing Lu and Vincent Ng. 2018. Event coreference res-olution: A survey of two decades of research. InProceedings of the 27th International Joint Confer-ence on Artificial Intelligence, pages 5479–5486.

    Jing Lu, Deepak Venugopal, Vibhav Gogate, and Vin-cent Ng. 2016. Joint inference for event coreferenceresolution. In Proceedings of the 26th InternationalConference on Computational Linguistics: Techni-cal Papers, pages 3264–3275.

    Xiaoqiang Luo. 2005. On coreference resolution per-formance metrics. In Proceedings of the Confer-ence on Human Language Technology and Empiri-cal Methods in Natural Language Processing, pages25–32.

    Christopher D. Manning, Mihai Surdeanu, John Bauer,Jenny Finkel, Steven J. Bethard, and David Mc-Closky. 2014. The Stanford CoreNLP natural lan-guage processing toolkit. In Proceedings of the52nd Annual Meeting of the Association for Compu-tational Linguistics: System Demonstrations, pages55–60.

    Katie McConky, Rakesh Nagi, Moises Sudit, andWilliam Hughes. 2012. Improving event co-reference by context extraction and dynamic featureweighting. In Proceedings of the 2012 IEEE Inter-national Multi-Disciplinary Conference on Cogni-tive Methods in Situation Awareness and DecisionSupport, pages 38–43.

    Vincent Ng and Claire Cardie. 2002. Improving ma-chine learning approaches to coreference resolution.In Proceedings of the 40th Annual Meeting of the As-sociation for Computational Linguistics, pages 104–111.

    Boyuan Pan, Yazheng Yang, Zhou Zhao, YuetingZhuang, Deng Cai, and Xiaofei He. 2018. Dis-course marker augmented network with reinforce-ment learning for natural language inference. InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers), pages 989–999.

    Robert Parker, David Graff, Junbo Kong, Ke Chen,and Kazuaki Maeda. 2009. English Gigaword fourthedition. In Linguistic Data Consortium.

    Haoruo Peng, Yangqi Song, and Dan Roth. 2016.Event detection and co-reference with minimal su-pervision. In Proceedings of the 2016 Conferenceon Empirical Methods in Natural Language Pro-cessing, pages 392–402.

    Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. GloVe: Global vectors for wordrepresentation. In Proceedings of the 2014 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 1532–1543.

    Marta Recasens and Eduard Hovy. 2011. BLANC: Im-plementing the Rand Index for coreference evalu-ation. Natural Language Engineering, 17(4):485–510.

    Satyan Sangeetha and Michael Arock. 2012. Eventcoreference resolution using mincut based graphclustering. In Proceedings of the 4th InternationalWorkshop on Computer Networks & Communica-tions, pages 253–260.

    Wee Meng Soon, Hwee Tou Ng, and DanielChung Yong Lim. 2001. A machine learning ap-proach to coreference resolution of noun phrases.Computational Linguistics, 27(4):521–544.

    Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010.Word representations: A simple and general methodfor semi-supervised learning. In Proceedings of the48th Annual Meeting of the Association for Compu-tational Linguistics, pages 384–394.

    Marc Vilain, John Burger, John Aberdeen, Dennis Con-nolly, and Lynette Hirschman. 1995. A model-theoretic coreference scoring scheme. In Proceed-ings of the Sixth Message Understanding Confer-ence, pages 45–52.

    Bishan Yang, Claire Cardie, and Peter Frazier. 2015. Ahierarchical distance-dependent bayesian model forevent coreference resolution. Transactions of theAssociation for Computational Linguistics, 3:517–528.

    Zhilin Yang, Bhuwan Dhingra, Ye Yuan, Junjie Hu,William W. Cohen, and Ruslan Salakhutdinov. 2017.Words or Characters? Fine-grained gating for read-ing comprehension. In Proceedings of the 5th Inter-national Conference on Learning Representations.

    http://www.lrec-conf.org/proceedings/lrec2014/pdf/646_Paper.pdfhttp://www.lrec-conf.org/proceedings/lrec2014/pdf/646_Paper.pdfhttp://www.lrec-conf.org/proceedings/lrec2016/pdf/1005_Paper.pdfhttp://www.lrec-conf.org/proceedings/lrec2016/pdf/1005_Paper.pdfhttps://www.aclweb.org/anthology/P17-1009https://www.aclweb.org/anthology/P17-1009https://www.ijcai.org/proceedings/2018/0773.pdfhttps://www.ijcai.org/proceedings/2018/0773.pdfhttps://www.aclweb.org/anthology/C16-1308https://www.aclweb.org/anthology/C16-1308https://aclweb.org/anthology/papers/H/H05/H05-1004/https://aclweb.org/anthology/papers/H/H05/H05-1004/http://www.aclweb.org/anthology/P/P14/P14-5010http://www.aclweb.org/anthology/P/P14/P14-5010https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6188406https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6188406https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6188406https://www.aclweb.org/anthology/P02-1014https://www.aclweb.org/anthology/P02-1014https://aclweb.org/anthology/P18-1091https://aclweb.org/anthology/P18-1091https://aclweb.org/anthology/P18-1091https://catalog.ldc.upenn.edu/LDC2009T13https://catalog.ldc.upenn.edu/LDC2009T13https://aclweb.org/anthology/papers/D/D16/D16-1038https://aclweb.org/anthology/papers/D/D16/D16-1038http://www.aclweb.org/anthology/D14-1162http://www.aclweb.org/anthology/D14-1162http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.300.9229&rep=rep1&type=pdfhttp://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.300.9229&rep=rep1&type=pdfhttp://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.300.9229&rep=rep1&type=pdfhttps://airccj.org/CSCP/vol2/csit2422.pdfhttps://airccj.org/CSCP/vol2/csit2422.pdfhttps://airccj.org/CSCP/vol2/csit2422.pdfhttps://www.aclweb.org/anthology/J01-4004https://www.aclweb.org/anthology/J01-4004http://www.aclweb.org/anthology/P10-1040http://www.aclweb.org/anthology/P10-1040https://www.aclweb.org/anthology/M95-1005https://www.aclweb.org/anthology/M95-1005http://www.aclweb.org/anthology/Q15-1037http://www.aclweb.org/anthology/Q15-1037http://www.aclweb.org/anthology/Q15-1037https://openreview.net/forum?id=B1hdzd5lghttps://openreview.net/forum?id=B1hdzd5lg