arXiv:1911.01746v1 [cs.CL] 5 Nov 2019 I < \mention> was hired to do some Christmas music Iwas hired to do some Christmas music, and it was just Jingle Bells and

$Page 1: arXiv:1911.01746v1 [cs.CL] 5 Nov 2019 I < \mention> was hired to do some Christmas music Iwas hired to do some Christmas music, and it was just Jingle Bells and$
CorefQA: Coreference Resolution as Query-based Span Prediction

Wei Wu♣, Fei Wang♣, Arianna Yuan�♣, Fei Wu♠ and Jiwei Li♠♣♠ Department of Computer Science and Technology, Zhejiang University

� Computer Science Department, Stanford University♣ ShannonAI

[email protected], [email protected]{wei wu, fei wang,jiwei li}@shannonai.com

Abstract

In this paper, we present CorefQA, an accu-rate and extensible approach for the corefer-ence resolution task. We formulate the prob-lem as a span prediction task, like in ques-tion answering: A query is generated for eachcandidate mention using its surrounding con-text, and a span prediction module is em-ployed to extract the text spans of the corefer-ences within the document using the generatedquery. This formulation comes with the fol-lowing key advantages: (1) The span predic-tion strategy provides the flexibility of retriev-ing mentions left out at the mention proposalstage; (2) In the question answering frame-work, encoding the mention and its context ex-plicitly in a query makes it possible to havea deep and thorough examination of cues em-bedded in the context of coreferent mentions;and (3) A plethora of existing question an-swering datasets can be used for data augmen-tation to improve the model’s generalizationcapability. Experiments demonstrate signifi-cant performance boost over previous models,with 83.1 (+3.5) F1 score on the CoNLL-2012benchmark and 87.5 (+2.5) F1 score on theGAP benchmark. 1

1 Introduction

Recent coreference resolution systems (Lee et al.,2017, 2018; Zhang et al., 2018a; Kantor andGloberson, 2019) consider all text spans in a doc-ument as potential mentions and learn to find anantecedent for each possible mention. There aretwo key issues with this paradigm, in terms of taskformalization and the algorithm.

At the task formalization level, mentions left outat the mention proposal stage can never be recov-ered since the downstream module only operateson the proposed mentions. Existing models of-ten suffer from mention proposal (Zhang et al.,

1https://github.com/ShannonAI/CorefQA

Original PassageIn addition , many people were poisonedwhen toxic gas was released. They were poi-soned and did not know how to protect them-selves against the poison.Converted QuestionsQ1: Who were poisoned when toxic gas wasreleased?A1: [They, themselves]Q2: What was released when many peoplewere poisoned?A2: [the poison]Q3: Who were poisoned and did not knowhow to protect themselves against the poison?A3: [many people, themselves]Q4: Whom did they not know how to protectagainst the poison?A4: [many people, They]Q5: They were poisoned and did not knowhow to protect themselves against what?A5: [toxic gas]

Figure 1: An illustration of the paradigm shift fromcoreference resolution to query-based span prediction.Spans with the same color represent coreferent men-tions. Note that we use a more direct strategy to gener-ate the questions based on the mentions.

2018a). The coreference datasets can only pro-vide a weak signal for spans that correspond to en-tity mentions because singleton mentions are notexplicitly labeled. Due to the inferiority of themention proposal model, it would be favorable ifa coreference framework had a mechanism to re-trieve left-out mentions.

At the algorithm level, existing end-to-endmethods (Lee et al., 2017, 2018; Zhang et al.,2018a) score each pair of mentions only based onmention representations from the output layer of

arX

iv:1

911.

0174

6v4

[cs

.CL

] 1

8 Ju

l 202

0

https://github.com/ShannonAI/CorefQA

a contextualization model. This means that themodel lacks the connection between mentions andtheir contexts. Semantic matching operations be-tween two mentions (and their contexts) are per-formed only at the output layer and are relativelysuperficial. Therefore it is hard for their models tocapture all the lexical, semantic and syntactic cuesin the context.

To alleviate these issues, we propose CorefQA,a new approach that formulates the coreferenceresolution problem as a span prediction task, akinto the question answering setting. A query is gen-erated for each candidate mention using its sur-rounding context, and a span prediction module isfurther employed to extract the text spans of thecoreferences within the document using the gen-erated query. Some concrete examples are shownin Figure 1. 2

This formulation provides benefits at both thetask formulation level and the algorithm level. Atthe task formulation level, since left-out mentionscan still be retrieved at the span prediction stage,the negative effect of undetected mentions is sig-nificantly alleviated. At the algorithm level, bygenerating a query for each candidate mention us-ing its surrounding context, the CorefQA modelexplicitly considers the surrounding context of thetarget mentions, the influence of which will laterbe propagated to each input word using the self-attention mechanism. Additionally, unlike exist-ing end-to-end methods (Lee et al., 2017, 2018;Zhang et al., 2018a), where the interactions be-tween two mentions are only superficially mod-eled at the output layer of contextualization, spanprediction requires a more thorough and deeperexamination of the lexical, semantic and syntac-tic cues within the context, which will potentiallylead to better performance.

Moreover, the proposed question answering for-mulation allows us to take advantage of existingquestion answering datasets. Coreference annota-tion is expensive, cumbersome and often requireslinguistic expertise from annotators. Under theproposed formulation, the coreference resolutionhas the same format as the existing question an-swering datasets (Rajpurkar et al., 2016a, 2018;Dasigi et al., 2019a). Those datasets can thusreadily be used for data augmentation. We showthat pre-training on existing question answering

2This is an illustration of the question formulation. Theactual operation is described in Section 3.4.

datasets improves the model’s generalization andtransferability, leading to additional performanceboost.

Experiments show that the proposed frameworksignificantly outperforms previous models on twowidely-used datasets. Specifically, we achievenew state-of-the-art scores of 83.1 (+3.5) on theCoNLL-2012 benchmark and 87.5 (+2.5) on theGAP benchmark.

2 Related Work

2.1 Coreference Resolution

Coreference resolution is a fundamental prob-lem in natural language processing and is con-sidered as a good test of machine intelligence(Morgenstern et al., 2016). Neural network mod-els have shown promising results over the years.Earlier neural-based models (Wiseman et al.,2016; Clark and Manning, 2015, 2016) rely onparsers and hand-engineered mention proposal al-gorithms. Recent work (Lee et al., 2017, 2018;Kantor and Globerson, 2019) tackled the problemin an end-to-end fashion by jointly detecting men-tions and predicting coreferences. Based on howentity-level information is incorporated, they canbe further categorized as (1) entity-level models(Bjorkelund and Kuhn, 2014; Clark and Manning,2015, 2016; Wiseman et al., 2016) that directlymodel the representation of real-world entities and(2) mention-ranking models (Durrett and Klein,2013; Wiseman et al., 2015; Lee et al., 2017) thatlearn to select the antecedent of each anaphoricmention. Our CorefQA model is essentially amention-ranking model, but we identify corefer-ence using question answering.

2.2 Formalizing NLP Tasks as questionanswering

Machine reading comprehension is a general andextensible task form. Many tasks in naturallanguage processing can be framed as readingcomprehension while abstracting away the task-specific modeling constraints.

McCann et al. (2018) introduced the decaNLPchallenge, which converts a set of 10 core tasksin NLP to reading comprehension. He et al.(2015) showed that semantic role labeling annota-tions could be solicited by using question-answerpairs to represent the predicate-argument struc-ture. Levy et al. (2017) reduced relation extractionto answering simple reading comprehension ques-

<mention> I <\mention> was hired to do some Christmas music

I was hired to do some Christmas music, and it was just “Jingle Bells” and I brought my cat with me to the studio, and I was working on the song and the cat jumped up into the record booth and started meowing along, meowing to me.

I was hired to do some Christmas music, and it was just “Jingle Bells”and I brought my cat with me to the studio, and I was working on thesong and the cat jumped up into the record booth and started meowing along, meowing to me.

And I brought <mention> my cat <\mention> with me to the studio


And I was working on <mention> the song <\mention>


[I][I][my][me][I]

[me]

[my cat][the cat]

[“Jingle Bells”] [the song]


Input Passage

Mention Proposal Module

I

…

my cat

…

the song

Proposed Mentions

Mention Linking Module

Question:

Question:

Passage:

Passage:

Passage:

CoreferenceClusters

Question:

Figure 2: The overall architecture of our CorefQA model. The input passage is first fed into the Mention ProposalModule 3.3 to obtain candidate mentions. Then the Mention Linking Module 3.4 is used to extract coreferentmentions from the passage for each proposed mention. The coreference clusters are obtained using the scoresproduced in the above two stages.

tions, yielding models that generalize better in thezero-shot setting. Li et al. (2019a,b) cast the tasksof named entity extraction and relation extractionas a reading comprehension problem. In paral-lel to our work, Aralikatte et al. (2019) convertedcoreference and ellipsis resolution in a questionanswering format, and showed the benefits oftraining joint models for these tasks. Their modelsare built under the assumption that gold mentionsare provided at inference time, whereas our modeldoes not need that assumption – it jointly trainsthe mention proposal model and the coreferenceresolution model in an end-to-end manner. Chada(2019) proposed an extractive QA model for theresolution of ambiguous pronouns, and showedbetter results on the GAP dataset by only fine-tuning the pre-trained BERT model.

2.3 Data AugmentationData augmentation is a strategy that enables prac-titioners to significantly increase the diversity ofdata available for training models. Data aug-mentation techniques have been explored in var-ious fields such as question answering (Talmorand Berant, 2019), text classification (Kobayashi,2018) and dialogue language understanding (Houet al., 2018). In coreference resolution, Zhao et al.

(2018); Emami et al. (2019); Zhao et al. (2019) fo-cused on debiasing the gender bias problem; Ara-likatte et al. (2019) explored the effectiveness ofjoint modeling of ellipsis and coreference resolu-tion. To the best of our knowledge, we are the firstto use existing question answering datasets as dataaugmentation for coreference resolution.

3 Model

In this section, we describe our CorefQA modelin detail. The overall architecture is illustrated inFigure 2.

3.1 Notations

Given a sequence of input tokens X ={x1, x2, ..., xn} in a document, where n denotesthe length of the document. N = n ∗ (n + 1)/2denotes the number of all possible text spansin X . Let ei denotes the i-th span repre-sentation 1 ≤ i ≤ N , with the start indexFIRST(i) and the end index LAST(i). ei ={xFIRST(i), xFIRST(i)+1, ..., xLAST(i)−1, xLAST(i)}.The task of coreference resolution is to determinethe antecedents for all possible spans. If a candi-date span ei does not represent an entity mentionor is not coreferent with any other mentions, a

dummy token ε is assigned as its antecedent. Thelinking between all possible spans e defines thefinal clustering.

3.2 Input RepresentationsWe use the SpanBERT model 3 to obtain input rep-resentations following Joshi et al. (2019a). Eachtoken xi is associated with a SpanBERT represen-tation xi. Since the speaker information is in-dispensable for coreference resolution, previousmethods (Wiseman et al., 2016; Lee et al., 2017;Joshi et al., 2019a) usually convert the speaker in-formation into binary features indicating whethertwo mentions are from the same speaker. How-ever, we use a straightforward strategy that di-rectly concatenates the speaker’s name with thecorresponding utterance. This strategy is inspiredby recent research in personalized dialogue mod-eling that use persona information to representspeakers (Li et al., 2016; Zhang et al., 2018b;Mazare et al., 2018). In subsection 5.2, we willempirically demonstrate its superiority over thefeature-based method in Lee et al. (2017).

To fit long documents into SpanBERT, we usea sliding-window approach that creates a T -sizedsegment after every T /2 tokens. Segments are thenpassed to the SpanBERT encoder independently.The final token representations are derived by tak-ing the token representations with maximum con-text.

3.3 Mention ProposalSimilar to Lee et al. (2017), our model considersall spans up to a maximum length L as potentialmentions. To improve computational efficiency,we further prune the candidate spans greedily dur-ing both training and evaluation. To do so, themention score of each candidate span consists ofthree parts: (1) xFIRST(i) is the start of a span; (2)xLAST(i) is the end of a span; and (3) xFIRST(i) andxLAST(i) form a valid span. The third part (i.e.,(3)) is necessary because each sentence can con-tain multiple spans. The first part is computed byfeeding xFIRST(i) into a feed-forward layer:

sm(xFIRST(i)) = FFNN([xFIRST(i)]) (1)

Similarly, the first part is computed by feedingxLAST(i) into a feed-forward layer:

sm(xLAST(i)) = FFNN([xLAST(i)]) (2)3https://github.com/facebookresearch/

SpanBERT

The third part computed by feeding the concate-nation of xFIRST(i) and xLAST(i) into into a feed-forward layer:

sm(xFIRST(i),xLAST(i)) = FFNNm([xFIRST(i),xLAST(i)])(3)

FFNN() denotes the feed-forward neural networkthat computes a nonlinear mapping from the inputvector to the mention score. The three involvedFFNN() use separate sets of parameters. The over-all score for span i being a mention is the averageof the three parts:

sm(i) =[sm(xFIRST(i)) + sm(xEND(i))

+sm(xFIRST(i),xLAST(i))]/3(4)

We only keep up to λn (where n is the documentlength) spans with the highest mention scores.

Mention Proposal Pretraining It is crucial thatthe mention proposal model is pretrained. Other-wise, most of the proposed mentions that are fed tothe linking stage are invalid mentions. The men-tion proposal model is pretrained by jointly train-ing three binary classification models: (1) whetherxFIRST(i) is the start of a span; (2) whether xLAST(i)

is the end of a span; and (3) whether xFIRST(i) andxLAST(i) should be combined. This leads to the ob-jective of the mention proposal model as follows:

Loss(m) =sigmoid(sm(xFIRST(i)))

+sigmoid(sm(xEND(i)))

+sigmoid(sm(xFIRST(i),xLAST(i)))

(5)

3.4 Mention Linking as Span PredictionGiven a mention ei proposed by the mention pro-posal network, the role of the mention linking net-work is to give a score sa(i, j) for any text spanej , indicating whether ei and ej are coreferent. Wepropose to use the question answering frameworkas the backbone to compute sa(i, j). It operates onthe triplet {context (X), query (q), answers (a)}.The context X is the input document. The queryq(ei) is constructed as follows: given ei, we usethe sentence that ei resides in as the query, with theminor modification that we encapsulates ei withspecial tokens < mention >< /mention > .The answers a are the coreferent mentions of ei.A query i is considered unanswerable in the fol-lowing scenarios: (1) the candidate span ei doesnot represent an entity mention or (2) the candi-date span ei represents an entity mention but is notcoreferent with any other mentions in X .

https://github.com/facebookresearch/SpanBERT

https://github.com/facebookresearch/SpanBERT

Following Devlin et al. (2019), we repre-sent the input query and the context as a sin-gle packed sequence. The for any span j =[FIRST(j), ..., LAST(j)], we first compute thescore of i being the answer for query q(ei), de-noted by sa(j|i). Let xFIRST(j)|i and xLAST(j)|i re-spectively denote the representations for FIRST(j)and LAST(j) from BERT, where q(ei) is used asquery concatenated to the context. sa(j|i) is com-puted by feeding the first and the last of its con-stituent token representations (i.e., xFIRST(j)|i andxLAST(j)|i ) into a feed-forward layer:

sa(j|i) = FFNNj|i[xFIRST(j)|i,xLAST(j)|i] (6)

FFNNj|i denotes the feed-forward neural networkthat computes a nonlinear mapping from the inputvector to the mention score. Comparing Eq.8 withEq.3, we can observe their relatedness and differ-ence: both of the equations compute scores for aspan. But for Eq.8, the query q(ei) is addition-ally used to check whether span j is the answerfor q(ei).

A closer look at Eq.8 reveals that it only modelsthe uni-directional coreference relation from ei toej , i.e., ej is the answer for query q(ei). This issuboptimal since if ei is a coreference mention ofej , then ej should also be the coreference mentionei. We thus need to optimize the bi-directional re-lation between ei and ej .4 The final score sa(i, j)is thus given as follows:

sa(i, j) =1

2(sa(j|i) + sa(i|j)) (7)

sa(i|j) can be computed in the same way assa(j|i), in which q(ei) is used as the query:

sa(i|j) = FFNNi|j [xFIRST(i)|j ,xLAST(i)|j ] (8)

where xFIRST(i)|j and xLAST(i)|j respectively de-note the representations for FIRST(i) and LAST(i)from BERT, where q(ej) is used as query concate-nated to the context.

For a pair of text span ei and ej , the premisesfor them being coreferent mentions are (1) they arementions and (2) they are coreferent. This makesthe overall score s(i, j) for ei and ej the combina-tion of Eq.3 and Eq.7:

s(i, j) = λ[sm(i) + sm(j)] + (1− λ)sa(i, j) (9)4This bidirectional relationship is actually referred to as

mutual dependency and has shown to benefit a wide range ofNLP tasks such as machine translation (Hassan et al., 2018)or dialogue generation (Li et al., 2015).

λ is the hyperparameter to control the tradeoff be-tween mention proposal and mention linking.

3.5 Antecedent Pruning

Given a document X with length n and the num-ber of spans O(n2), the computation of Eq.9 forall mention pairs is intractable with the complex-ity of O(n4). Given an extracted mention ei, thecomputation of Eq.9 for (ei, ej) regarding all ejis still extremely intensive since the computationof the backward span prediction score sa(i|j) re-quires running question answering models on allquery q(ej). A further pruning procedure is thusneeded: For each query q(ei), we collect C spancandidates only based on the sa(j|i) scores, andthen use Eq. 9 to compute the overall scores.

3.6 Training

For each mention ei proposed by the mention pro-posal network, it is associated with C potentialspans proposed by the mention linking networkbased on s(j|i), we aim to optimize the marginallog-likelihood of all correct antecedents impliedby the gold clustering. Following Lee et al. (2017),we append a dummy token ε to the C candidates.The model will output it if none of theC span can-didates is coreferent with ei. For each mention ei,the model learns a distribution P (·) over all possi-ble antecedent spans ej based on the global scores(i, j) from Eq. 9:

P (ej) =es(i,j)∑

j′∈C es(i,j′)

(10)

The mention proposal module and the mentionlinking module are jointly trained in an end-to-endfashion using training signals from Eq.10, with theSpanBERT parameters shared.

3.7 Inference

Given an input document, we can obtain an undi-rected graph using the overall score, each node ofwhich represents a candidate mention from eitherthe mention proposal module or the mention link-ing module. We prune the graph by keeping theedge whose weight is the largest for each nodebased on Eq.10. Nodes whose closest neighbor isthe dummy token ε are abandoned. Therefore, themention clusters can be decoded from the graph.

3.8 Data Augmentation using QuestionAnswering Datasets

We hypothesize that the reasoning (such as syn-onymy, world knowledge, syntactic variation, andmultiple sentence reasoning) required to answerthe questions are also indispensable for corefer-ence resolution. Annotated question answeringdatasets are usually significantly larger than thecoreference datasets due to the high linguistic ex-pertise required for the latter. Under the pro-posed QA formulation, coreference resolution hasthe same format as the existing question answer-ing datasets (Rajpurkar et al., 2016a, 2018; Dasigiet al., 2019a). In this way, they can readily be usedfor data augmentation. We thus propose to pre-train the mention linking network on the Quorefdataset (Dasigi et al., 2019b), and the SQuADdataset (Rajpurkar et al., 2016b).

3.9 Summary and Discussion

Comparing with existing models (Lee et al., 2017,2018; Joshi et al., 2019b), the proposed questionanswering formalization has the flexibility of re-trieving mentions left out at the mention proposalstage. However, since we still have the mentionproposal model, we need to know in which situ-ation missed mentions could be retrieved and inwhich situation they cannot. We use the examplein Figure 1 as an illustration, in which {many peo-ple, They, themselves} are coreferent mentions: Ifpartial mentions are missed by the mention pro-posal model, e.g., many people and They, they canstill be retrieved in the mention linking stage whenthe not-missed mention (i.e., themselves) is usedas query. But, if all the mentions within the clusterare missed, none of them can be used for queryconstruction, which means they all will be irre-versibly left out. Given the fact that the proposalmention network proposes a significant number ofmentions, the chance that mentions within a men-tion cluster are all missed is relatively low (whichexponentially decreases as the number of entitiesincreases). This explains the superiority (thoughfar from perfect) of the proposed model. However,how to completely remove the mention proposalnetwork remains a problem in the field of corefer-ence resolution.

4 Experiments

4.1 Implementation Details

The special tokens used to denote the speaker’sname (< speaker >< /speaker >) and the spe-cial tokens used to denote the queried mentions(< mention >< /mention >) are initializedby randomly taking the unused tokens from theSpanBERT vocabulary. The sliding window sizeT = 512, and the mention keep ratio λ = 0.2. Themaximum length L for mention proposal = 10 andthe maximum number of antecedents kept for eachmention C = 50. The SpanBERT parameters areupdated by the Adam optimizer (Kingma and Ba,2015) with initial learning rate 1 × 10−5 and thetask parameters are updated by the Range opti-mizer 5 with initial learning rate 2× 10−4.

4.2 Baselines

We compare the CorefQA model with previousneural models that are trained end-to-end:

• e2e-coref (Lee et al., 2017) is the first end-to-end coreference system that learns whichspans are entity mentions and how to bestcluster them jointly. Their token representa-tions are built upon the GLoVe (Penningtonet al., 2014) and Turian (Turian et al., 2010)embeddings.

• c2f-coref + ELMo (Lee et al., 2018) extendsLee et al. (2017) by combining a coarse-to-fine pruning with a higher-order inferencemechanism. Their representations are builtupon ELMo embeddings (Peters et al., 2018).

• c2f-coref + BERT-large(Joshi et al., 2019b)builds the c2f-coref system on top of BERT(Devlin et al., 2019) token representations.

• EE + BERT-large (Kantor and Globerson,2019) represents each mention in a cluster viaan approximation of the sum of all mentionsin the cluster.

• c2f-coref + SpanBERT-large (Joshi et al.,2019a) focuses on pre-training span represen-tations to better represent and predict spans oftext.

5https://github.com/lessw2020/Ranger-Deep-Learning-Optimizer

https://github.com/lessw2020/Ranger-Deep-Learning-Optimizer

https://github.com/lessw2020/Ranger-Deep-Learning-Optimizer

MUC B3 CEAFφ4

P R F1 P R F1 P R F1 Avg. F1

e2e-coref(Lee et al., 2017) 78.4 73.4 75.8 68.6 61.8 65.0 62.7 59.0 60.8 67.2c2f-coref + ELMo (Lee et al., 2018) 81.4 79.5 80.4 72.2 69.5 70.8 68.2 67.1 67.6 73.0EE + BERT-large (Kantor and Globerson, 2019) 82.6 84.1 83.4 73.3 76.2 74.7 72.4 71.1 71.8 76.6c2f-coref + BERT-large (Joshi et al., 2019b) 84.7 82.4 83.5 76.5 74.0 75.3 74.1 69.8 71.9 76.9c2f-coref + SpanBERT-large (Joshi et al., 2019a) 85.8 84.8 85.3 78.3 77.9 78.1 76.4 74.2 75.3 79.6

CorefQA + SpanBERT-base 85.2 87.4 86.3 78.7 76.5 77.6 76.0 75.6 75.8 79.9 (+0.3)CorefQA + SpanBERT-large 88.6 87.4 88.0 82.4 82.0 82.2 79.9 78.3 79.1 83.1 (+3.5)

Table 1: Evaluation results on the English CoNLL-2012 shared task. The average F1 of MUC, B3, and CEAFφ4is

the main evaluation metric. Ensemble models are not included in the table for a fair comparison.

Model M F B O

e2e-coref 67.2 62.2 0.92 64.7c2f-coref + ELMo 75.8 71.1 0.94 73.5c2f-coref + BERT-large 86.9 83.0 0.95 85.0

c2f-coref + SpanBERT-large 88.8 84.9 0.96 86.8CorefQA + SpanBERT-large 88.9 86.1 0.97 87.5

Table 2: CorefQA achieves the state-of-the-art perfor-mance on all metrics including F1 scores on Masculineand Feminine examples, a Bias factor (F / M) and theOverall F1 score.

4.3 Results on CoNLL-2012 Shared TaskThe English data of CoNLL-2012 shared task(Pradhan et al., 2012) contains 2,802/343/348train/development/test documents in 7 differentgenres. The main evaluation is the average of threemetrics – MUC (Vilain et al., 1995), B3 (Baggaand Baldwin, 1998), and CEAFφ4 (Luo, 2005) onthe test set according to the official CoNLL-2012evaluation scripts 6.

We compare the CorefQA model with severalbaseline models in Table 1. Our CorefQA systemachieves a huge performance boost over existingsystems: With SpanBERT-base, it achieves an F1score of 79.9, which already outperforms the pre-vious SOTA model using SpanBERT-large by 0.3.With SpanBERT-large, it achieves an F1 score of83.1, with a 3.5 performance boost over the previ-ous SOTA system.

4.4 Results on GAPThe GAP dataset (Webster et al., 2018) is agender-balanced dataset that targets the challengesof resolving naturally occurring ambiguous pro-nouns. It comprises 8,908 coreference-labeledpairs of (ambiguous pronoun, antecedent name)sampled from Wikipedia.

6http://conll.cemantix.org/2012/software.html

Avg. F1 ∆

CorefQA 83.4−– SpanBERT 79.6 -3.8−– Mention Proposal Pre-training 75.9 -7.5−– Question Answering 75.0 -8.4−– Quoref Pre-training 82.7 -0.7−– Squad Pre-training 83.1 -0.3

Table 3: Ablation studies on the CoNLL-2012 de-velopment set. SpanBERT token representations, themention-proposal pre-training, and the question an-swering pre-training all contribute significantly to thegood performance of the full model.

We follow the protocols in Webster et al.(2018); Joshi et al. (2019b) and use the off-the-shelf resolver trained on the CoNLL-2012 datasetto get the performance of the GAP dataset. Ta-ble 2 presents the results. We can see that theproposed CorefQA model achieves state-of-the-artperformance on all metrics on the GAP dataset.

5 Ablation Study and Analysis

We perform comprehensive ablation studiesand analyses on the CoNLL-2012 developmentdataset. Results are shown in Table 3.

5.1 Effects of Different Modules in theProposed Framework

Effect of SpanBERT Replacing SpanBERTwith vanilla BERT leads to a 3.5 F1 degrada-tion. This verifies the importance of span-levelpre-training for coreference resolution and is con-sistent with previous findings (Joshi et al., 2019a).

Effect of Pre-training Mention Proposal Net-work Skipping the pre-training of the mentionproposal network using golden mentions resultsin a 7.2 F1 degradation, which is in line with our

http://conll.cemantix.org/2012/software.html

http://conll.cemantix.org/2012/software.html

1 2 3 4 5 6 7+

102030405060708090

100

Number of speakers per document

%F1(Speaker as feature)F1(Speaker as input)Frequency

Figure 3: Performance on the development set of theCoNLL-2012 dataset with various number of speakers.F1(Speaker as feature): F1 score for the strategy thattreats speaker information as a mention-pair feature.F1(Speaker as input): F1 score for our strategy thattreats speaker names as token input. Frequency: per-centage of documents with specific number of speak-ers.

expectation. A randomly initialized mention pro-posal model implies that mentions are randomlyselected. Randomly selected mentions will mostlybe transformed to unanswerable queries. Thismakes it hard for the question answering model tolearn at the initial training stage, leading to inferiorperformance.

Effect of QA pre-training on the augmenteddatasets One of the most valuable strengths ofconverting anaphora resolution to question an-swering is that existing QA datasets can be read-ily used for data augmentation purposes. We seea contribution of 0.7 F1 from pre-training on theQuoref dataset (Dasigi et al., 2019a) and a contri-bution of 0.3 F1 from pre-training on the SQuADdataset (Rajpurkar et al., 2016a).

Effect of Question Answering We aim to studythe pure performance gain of the paradigm shiftfrom mention-pair scoring to query-based spanprediction. For this purpose, we replace the men-tion linking module with the mention-pair scoringmodule described in Lee et al. (2018), while othersremain unchanged. We observe an 8.1 F1 degrada-tion in performance, demonstrating the significantsuperiority of the proposed question answeringframework over the mention-pair scoring frame-work.

10 20 30 40 5030

40

50

60

70

80

90

100

the number of spans λ kept per word

Men

tion

Rec

all(

%)

Joshi et al. (2019a) (various λ)Joshi et al. (2019a) (actual λ)Our model (various λ)Our model (actual λ)

Figure 4: Change of mention recalls as we increase thenumber of spans λ kept per word.

5.2 Analyses on speaker modeling strategies

We compare our speaker modeling strategy (de-noted by Speaker as input), which directly con-catenates the speaker’s name with the correspond-ing utterance, with the strategy in Wiseman et al.(2016); Lee et al. (2017); Joshi et al. (2019a)(denoted by Speaker as feature), which convertsspeaker information into binary features indicatingwhether two mentions are from the same speaker.We show the average F1 scores breakdown bydocuments according to the number of their con-stituent speakers in Figure 3.

Results show that the proposed strategy per-forms significantly better on documents with alarger number of speakers. Compared with thecoarse modeling of whether two utterances arefrom the same speaker, a speaker’s name can bethought of as speaker ID in persona dialogue learn-ing (Li et al., 2016; Zhang et al., 2018b; Mazareet al., 2018). Representations learned for nameshave the potential to better generalize the globalinformation of the speakers in the multi-party dia-logue situation, leading to better context modelingand thus better results.

5.3 Analysis on the Overall Mention Recall

Since the proposed framework has the potential toretrieve mentions missed at the mention proposalstage, we expect it to have higher overall mentionrecall rate than previous models (Lee et al., 2017,2018; Zhang et al., 2018a; Kantor and Globerson,2019).

We examine the proportion of gold mentionscovered in the development set as we increase thehyperparameter λ (the number of spans kept per

1

[Freddie Mac] is giving golden parachutes totwo of its ousted executives. . . . Yesterday Fed-eral Prosecutions announced a criminal probeinto [the company].

2[A traveling reporter] now on leave and joinsus to tell [her] story. Thank [you] for coming into share this with us.

3

Paula Zahn: [Thelma Gutierrez] went insidethe forensic laboratory where scientists are try-ing to solve this mystery.Thelma Gutierrez: In this laboratory alone [I]’m surrounded by the remains of at least twentydifferent service members who are in the pro-cess of being identified so that they too can gohome.

Table 4: Example mention clusters that were correctlypredicted by our model, but wrongly predicted by c2f-coref + SpanBERT-large. Bold spans in brackets rep-resent coreferent mentions. Italic spans represent thespeaker’s name of the utterance.

word) in Figure 4. Our model consistently outper-forms the baseline model with various values of λ.Notably, our model is less sensitive to smaller val-ues of λ. This is because missed mentions can stillbe retrieved at the mention linking stage.

5.4 Qualitative Analysis

We provide qualitative analyses to highlight thestrengths of our model in Table 4.

Shown in Example 1, by explicitly formulatingthe anaphora identification of the company as aquery, our model uses more information from alocal context, and successfully identifies FreddieMac as the answer from a longer distance.

The model can also efficiently harness thespeaker information in a conversational setting. InExample 3, it would be difficult to identify that[Thelma Gutierrez] is the correct antecedent ofmention [I] without knowing that Thelma Gutier-rez is the speaker of the second utterance. How-ever, our model successfully identifies it by di-rectly feeding the speaker’s name at the inputlevel.

6 Conclusion

In this paper, we present CorefQA, a corefer-ence resolution model that casts anaphora iden-tification as the task of query-based span pre-diction in question answering. We showed that

the proposed formalization can successfully re-trieve mentions left out at the mention proposalstage. It also makes data augmentation using aplethora of existing question answering datasetspossible. Furthermore, a new speaker modelingstrategy can also boost the performance in dia-logue settings. Empirical results on two widely-used coreference datasets demonstrate the effec-tiveness of our model. In future work, we will ex-plore novel approaches to generate the questionsbased on each mention, and evaluate the influenceof different question generation methods on thecoreference resolution task.

Acknowledgement

We thank all anonymous reviewers for their com-ments and suggestions. The work is supported bythe National Natural Science Foundation of China(NSFC No. 61625107 and 61751209).

ReferencesRahul Aralikatte, Matthew Lamm, Daniel Hardt,

and Anders Søgaard. 2019. Ellipsis and coref-erence resolution as question answering. CoRR,abs/1908.11141.

Amit Bagga and Breck Baldwin. 1998. Algorithms forscoring coreference chains. In In The First Interna-tional Conference on Language Resources and Eval-uation Workshop on Linguistics Coreference, pages563–566.

Anders Bjorkelund and Jonas Kuhn. 2014. Learn-ing structured perceptrons for coreference resolutionwith latent antecedents and non-local features. InProceedings of the 52nd Annual Meeting of the As-sociation for Computational Linguistics, ACL 2014,June 22-27, 2014, Baltimore, MD, USA, Volume 1:Long Papers, pages 47–57.

Rakesh Chada. 2019. Gendered pronoun resolution us-ing bert and an extractive question answering formu-lation. arXiv preprint arXiv:1906.03695.

Kevin Clark and Christopher D. Manning. 2015.Entity-centric coreference resolution with modelstacking. In Proceedings of the 53rd Annual Meet-ing of the Association for Computational Linguisticsand the 7th International Joint Conference on Natu-ral Language Processing of the Asian Federation ofNatural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers,pages 1405–1415.

Kevin Clark and Christopher D. Manning. 2016. Im-proving coreference resolution by learning entity-level distributed representations. In Proceedingsof the 54th Annual Meeting of the Association for

Computational Linguistics, ACL 2016, August 7-12,2016, Berlin, Germany, Volume 1: Long Papers.

Pradeep Dasigi, Nelson F. Liu, Ana Marasovic,Noah A. Smith, and Matt Gardner. 2019a. Quoref:A reading comprehension dataset with ques-tions requiring coreferential reasoning. CoRR,abs/1908.05803.

Pradeep Dasigi, Nelson F Liu, Ana Marasovic, Noah ASmith, and Matt Gardner. 2019b. Quoref: Areading comprehension dataset with questions re-quiring coreferential reasoning. arXiv preprintarXiv:1908.05803.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, NAACL-HLT 2019, Minneapolis, MN,USA, June 2-7, 2019, Volume 1 (Long and Short Pa-pers), pages 4171–4186.

Greg Durrett and Dan Klein. 2013. Easy victories anduphill battles in coreference resolution. In Proceed-ings of the 2013 Conference on Empirical Meth-ods in Natural Language Processing, EMNLP 2013,18-21 October 2013, Grand Hyatt Seattle, Seattle,Washington, USA, A meeting of SIGDAT, a SpecialInterest Group of the ACL, pages 1971–1982.

Ali Emami, Paul Trichelair, Adam Trischler, Ka-heer Suleman, Hannes Schulz, and Jackie Chi KitCheung. 2019. The knowref coreference corpus:Removing gender and number cues for difficultpronominal anaphora resolution. In Proceedings ofthe 57th Conference of the Association for Compu-tational Linguistics, ACL 2019, Florence, Italy, July28- August 2, 2019, Volume 1: Long Papers, pages3952–3961.

Hany Hassan, Anthony Aue, Chang Chen, VishalChowdhary, Jonathan Clark, Christian Feder-mann, Xuedong Huang, Marcin Junczys-Dowmunt,William Lewis, Mu Li, et al. 2018. Achieving hu-man parity on automatic chinese to english newstranslation. arXiv preprint arXiv:1803.05567.

Luheng He, Mike Lewis, and Luke Zettlemoyer. 2015.Question-answer driven semantic role labeling: Us-ing natural language to annotate natural language.In Proceedings of the 2015 Conference on EmpiricalMethods in Natural Language Processing, EMNLP2015, Lisbon, Portugal, September 17-21, 2015,pages 643–653.

Yutai Hou, Yijia Liu, Wanxiang Che, and Ting Liu.2018. Sequence-to-sequence data augmentation fordialogue language understanding. In Proceedings ofthe 27th International Conference on ComputationalLinguistics, COLING 2018, Santa Fe, New Mexico,USA, August 20-26, 2018, pages 1234–1245.

Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S.Weld, Luke Zettlemoyer, and Omer Levy. 2019a.Spanbert: Improving pre-training by representingand predicting spans. CoRR, abs/1907.10529.

Mandar Joshi, Omer Levy, Daniel S. Weld, andLuke Zettlemoyer. 2019b. BERT for corefer-ence resolution: Baselines and analysis. CoRR,abs/1908.09091.

Ben Kantor and Amir Globerson. 2019. Coreferenceresolution with entity equalization. In Proceedingsof the 57th Conference of the Association for Com-putational Linguistics, ACL 2019, Florence, Italy,July 28- August 2, 2019, Volume 1: Long Papers,pages 673–677.

Diederik P. Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In 3rd Inter-national Conference on Learning Representations,ICLR 2015, San Diego, CA, USA, May 7-9, 2015,Conference Track Proceedings.

Sosuke Kobayashi. 2018. Contextual augmentation:Data augmentation by words with paradigmatic re-lations. In Proceedings of the 2018 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, NAACL-HLT, New Orleans, Louisiana,USA, June 1-6, 2018, Volume 2 (Short Papers),pages 452–457.

Kenton Lee, Luheng He, Mike Lewis, and Luke Zettle-moyer. 2017. End-to-end neural coreference reso-lution. In Proceedings of the 2017 Conference onEmpirical Methods in Natural Language Process-ing, EMNLP 2017, Copenhagen, Denmark, Septem-ber 9-11, 2017, pages 188–197.

Kenton Lee, Luheng He, and Luke Zettlemoyer. 2018.Higher-order coreference resolution with coarse-to-fine inference. In Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, NAACL-HLT, New Orleans,Louisiana, USA, June 1-6, 2018, Volume 2 (ShortPapers), pages 687–692.

Omer Levy, Minjoon Seo, Eunsol Choi, and LukeZettlemoyer. 2017. Zero-shot relation extractionvia reading comprehension. In Proceedings of the21st Conference on Computational Natural Lan-guage Learning (CoNLL 2017), Vancouver, Canada,August 3-4, 2017, pages 333–342.

Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao,and Bill Dolan. 2015. A diversity-promoting objec-tive function for neural conversation models. arXivpreprint arXiv:1510.03055.

Jiwei Li, Michel Galley, Chris Brockett, Georgios PSpithourakis, Jianfeng Gao, and Bill Dolan. 2016.A persona-based neural conversation model. arXivpreprint arXiv:1603.06155.

Xiaoya Li, Jingrong Feng, Yuxian Meng, QinghongHan, Fei Wu, and Jiwei Li. 2019a. A unified mrcframework for named entity recognition. arXivpreprint arXiv:1910.11476.

Xiaoya Li, Fan Yin, Zijun Sun, Xiayu Li, AriannaYuan, Duo Chai, Mingxin Zhou, and Jiwei Li.2019b. Entity-relation extraction as multi-turn ques-tion answering. In Proceedings of the 57th Confer-ence of the Association for Computational Linguis-tics, ACL 2019, Florence, Italy, July 28- August 2,2019, Volume 1: Long Papers, pages 1340–1350.

Xiaoqiang Luo. 2005. On coreference resolution per-formance metrics. In HLT/EMNLP 2005, HumanLanguage Technology Conference and Conferenceon Empirical Methods in Natural Language Pro-cessing, Proceedings of the Conference, 6-8 October2005, Vancouver, British Columbia, Canada, pages25–32.

Pierre-Emmanuel Mazare, Samuel Humeau, MartinRaison, and Antoine Bordes. 2018. Trainingmillions of personalized dialogue agents. arXivpreprint arXiv:1809.01984.

Bryan McCann, Nitish Shirish Keskar, Caiming Xiong,and Richard Socher. 2018. The natural language de-cathlon: Multitask learning as question answering.CoRR, abs/1806.08730.

Leora Morgenstern, Ernest Davis, and Charles L. Or-tiz Jr. 2016. Planning, executing, and evaluat-ing the winograd schema challenge. AI Magazine,37(1):50–54.

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global vectors forword representation. In Proceedings of the 2014Conference on Empirical Methods in Natural Lan-guage Processing, EMNLP 2014, October 25-29,2014, Doha, Qatar, A meeting of SIGDAT, a SpecialInterest Group of the ACL, pages 1532–1543.

Matthew E. Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, NAACL-HLT 2018, New Or-leans, Louisiana, USA, June 1-6, 2018, Volume 1(Long Papers), pages 2227–2237.

Sameer Pradhan, Alessandro Moschitti, Nianwen Xue,Olga Uryupina, and Yuchen Zhang. 2012. Conll-2012 shared task: Modeling multilingual unre-stricted coreference in ontonotes. In Joint Con-ference on Empirical Methods in Natural Lan-guage Processing and Computational Natural Lan-guage Learning - Proceedings of the Shared Task:Modeling Multilingual Unrestricted Coreference inOntoNotes, EMNLP-CoNLL 2012, July 13, 2012,Jeju Island, Korea, pages 1–40.

Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.Know what you don’t know: Unanswerable ques-tions for squad. In Proceedings of the 56th AnnualMeeting of the Association for Computational Lin-guistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 2: Short Papers, pages 784–789.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016a. Squad: 100, 000+ questions formachine comprehension of text. In Proceedings ofthe 2016 Conference on Empirical Methods in Nat-ural Language Processing, EMNLP 2016, Austin,Texas, USA, November 1-4, 2016, pages 2383–2392.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016b. Squad: 100,000+ questionsfor machine comprehension of text. arXiv preprintarXiv:1606.05250.

Alon Talmor and Jonathan Berant. 2019. Multiqa: Anempirical investigation of generalization and trans-fer in reading comprehension. In Proceedings ofthe 57th Conference of the Association for Compu-tational Linguistics, ACL 2019, Florence, Italy, July28- August 2, 2019, Volume 1: Long Papers, pages4911–4921.

Joseph P. Turian, Lev-Arie Ratinov, and Yoshua Ben-gio. 2010. Word representations: A simple and gen-eral method for semi-supervised learning. In ACL2010, Proceedings of the 48th Annual Meeting of theAssociation for Computational Linguistics, July 11-16, 2010, Uppsala, Sweden, pages 384–394.

Marc B. Vilain, John D. Burger, John S. Aberdeen,Dennis Connolly, and Lynette Hirschman. 1995. Amodel-theoretic coreference scoring scheme. InProceedings of the 6th Conference on MessageUnderstanding, MUC 1995, Columbia, Maryland,USA, November 6-8, 1995, pages 45–52.

Kellie Webster, Marta Recasens, Vera Axelrod, and Ja-son Baldridge. 2018. Mind the GAP: A balancedcorpus of gendered ambiguous pronouns. TACL,6:605–617.

Sam Wiseman, Alexander M. Rush, and Stuart M.Shieber. 2016. Learning global features for coref-erence resolution. In NAACL HLT 2016, The 2016Conference of the North American Chapter of theAssociation for Computational Linguistics: Hu-man Language Technologies, San Diego California,USA, June 12-17, 2016, pages 994–1004.

Sam Wiseman, Alexander M. Rush, Stuart M. Shieber,and Jason Weston. 2015. Learning anaphoricityand antecedent ranking features for coreference res-olution. In Proceedings of the 53rd Annual Meet-ing of the Association for Computational Linguisticsand the 7th International Joint Conference on Natu-ral Language Processing of the Asian Federation ofNatural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers,pages 1416–1426.

Rui Zhang, Cıcero Nogueira dos Santos, Michihiro Ya-sunaga, Bing Xiang, and Dragomir R. Radev. 2018a.Neural coreference resolution with deep biaffine at-tention by joint mention detection and mention clus-tering. In Proceedings of the 56th Annual Meeting ofthe Association for Computational Linguistics, ACL2018, Melbourne, Australia, July 15-20, 2018, Vol-ume 2: Short Papers, pages 102–107.

Saizheng Zhang, Emily Dinan, Jack Urbanek, ArthurSzlam, Douwe Kiela, and Jason Weston. 2018b.Personalizing dialogue agents: I have a dog, do youhave pets too? arXiv preprint arXiv:1801.07243.

Jieyu Zhao, Tianlu Wang, Mark Yatskar, Ryan Cot-terell, Vicente Ordonez, and Kai-Wei Chang. 2019.Gender bias in contextualized word embeddings. InProceedings of the 2019 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages629–634.

Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Or-donez, and Kai-Wei Chang. 2018. Gender biasin coreference resolution: Evaluation and debias-ing methods. In Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, NAACL-HLT, New Orleans,Louisiana, USA, June 1-6, 2018, Volume 2 (ShortPapers), pages 15–20.

Documents

arXiv:1911.01746v1 [cs.CL] 5 Nov 2019 I < \mention> was hired to do some Christmas music Iwas hired to do some Christmas music, and it was just Jingle Bells and