15
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pages 4717–4731 August 1–6, 2021. ©2021 Association for Computational Linguistics 4717 Benchmarking Scalable Methods for Streaming Cross Document Entity Coreference Robert L. Logan IV *Andrew McCallum ♠♥ Sameer Singh Daniel Bikel University of California, Irvine Google Research University of Massachusetts, Amherst {rlogan, sameer}@uci.edu {mccallum, dbikel}@google.com Abstract Streaming cross document entity corefer- ence (CDC) systems disambiguate mentions of named entities in a scalable manner via in- cremental clustering. Unlike other approaches for named entity disambiguation (e.g., entity linking), streaming CDC allows for the disam- biguation of entities that are unknown at in- ference time. Thus, it is well-suited for pro- cessing streams of data where new entities are frequently introduced. Despite these benefits, this task is currently difficult to study, as exist- ing approaches are either evaluated on datasets that are no longer available, or omit other cru- cial details needed to ensure fair comparison. In this work, we address this issue by compil- ing a large benchmark adapted from existing free datasets, and performing a comprehensive evaluation of a number of novel and existing baseline models. 1 We investigate: how to best encode mentions, which clustering algorithms are most effective for grouping mentions, how models transfer to different domains, and how bounding the number of mentions tracked dur- ing inference impacts performance. Our re- sults show that the relative performance of neu- ral and feature-based mention encoders varies across different domains, and in most cases the best performance is achieved using a combi- nation of both approaches. We also find that performance is minimally impacted by limit- ing the number of tracked mentions. 1 Introduction The ability to disambiguate mentions of named entities in text is a central task in the field of in- formation extraction, and is crucial to topic track- ing, knowledge base induction and question an- swering. Recent work on this problem has fo- cused almost solely on entity linking–based ap- * Work done during an internship at Google Research. 1 Code and data available at: https://github.com/ rloganiv/streaming-cdc proaches, i.e., models that link mentions to a fixed set of known entities. While significant strides have been made on this front—with systems that can be trained end-to-end (Kolitsas et al., 2018), on millions of entities (Ling et al., 2020), and link to entities using only their textual descriptions (Lo- geswaran et al., 2019)—all entity linking systems suffer from the significant limitation that they are restricted to linking to a curated list of entities that is fixed at inference time. Thus they are of limited use when processing data streams where new enti- ties regularly appear, such as research publications, social media feeds, and news articles. In contrast, the alternative approach of cross- document entity coreference (CDC) (Bagga and Baldwin, 1998; Gooi and Allan, 2004; Singh et al., 2011; Dutta and Weikum, 2015), which disam- biguates mentions via clustering, does not suffer from this shortcoming. Instead most CDC algo- rithms suffer from a different failure mode: lack of scalability. Since they run expensive clustering routines over the entire set of mentions, they are not well suited to applications where mentions ar- rive one at a time. There are, however, a subset of streaming CDC methods that avoid this issue by clustering mentions incrementally (Figure 1). Un- fortunately, despite such methods’ apparent fitness for streaming data scenarios, this area of research has received little attention from the NLP commu- nity. To our knowledge there are only two existing works on the task (Rao et al., 2010; Shrimpton et al., 2015), and only the latter evaluates truly streaming systems, i.e., systems that process new mentions in constant time with constant memory. One crucial factor limiting research on this topic is a lack of free, publicly accessible benchmark datasets; datasets used in existing works are either small and impossible to reproduce (e.g., the dataset collected by Shrimpton et al. (2015) only contains a few hundred unique entities, and many of the

Benchmarking Scalable Methods for Streaming Cross Document

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Benchmarking Scalable Methods for Streaming Cross Document

Proceedings of the 59th Annual Meeting of the Association for Computational Linguisticsand the 11th International Joint Conference on Natural Language Processing, pages 4717–4731

August 1–6, 2021. ©2021 Association for Computational Linguistics

4717

Benchmarking Scalable Methods forStreaming Cross Document Entity Coreference

Robert L. Logan IV∗♦ Andrew McCallum♠♥Sameer Singh♦ Daniel Bikel♠

♦University of California, Irvine ♠Google Research♥University of Massachusetts, Amherst{rlogan, sameer}@uci.edu

{mccallum, dbikel}@google.com

Abstract

Streaming cross document entity corefer-ence (CDC) systems disambiguate mentionsof named entities in a scalable manner via in-cremental clustering. Unlike other approachesfor named entity disambiguation (e.g., entitylinking), streaming CDC allows for the disam-biguation of entities that are unknown at in-ference time. Thus, it is well-suited for pro-cessing streams of data where new entities arefrequently introduced. Despite these benefits,this task is currently difficult to study, as exist-ing approaches are either evaluated on datasetsthat are no longer available, or omit other cru-cial details needed to ensure fair comparison.In this work, we address this issue by compil-ing a large benchmark adapted from existingfree datasets, and performing a comprehensiveevaluation of a number of novel and existingbaseline models.1 We investigate: how to bestencode mentions, which clustering algorithmsare most effective for grouping mentions, howmodels transfer to different domains, and howbounding the number of mentions tracked dur-ing inference impacts performance. Our re-sults show that the relative performance of neu-ral and feature-based mention encoders variesacross different domains, and in most cases thebest performance is achieved using a combi-nation of both approaches. We also find thatperformance is minimally impacted by limit-ing the number of tracked mentions.

1 Introduction

The ability to disambiguate mentions of namedentities in text is a central task in the field of in-formation extraction, and is crucial to topic track-ing, knowledge base induction and question an-swering. Recent work on this problem has fo-cused almost solely on entity linking–based ap-

∗Work done during an internship at Google Research.1Code and data available at: https://github.com/

rloganiv/streaming-cdc

proaches, i.e., models that link mentions to a fixedset of known entities. While significant strideshave been made on this front—with systems thatcan be trained end-to-end (Kolitsas et al., 2018), onmillions of entities (Ling et al., 2020), and link toentities using only their textual descriptions (Lo-geswaran et al., 2019)—all entity linking systemssuffer from the significant limitation that they arerestricted to linking to a curated list of entities thatis fixed at inference time. Thus they are of limiteduse when processing data streams where new enti-ties regularly appear, such as research publications,social media feeds, and news articles.

In contrast, the alternative approach of cross-document entity coreference (CDC) (Bagga andBaldwin, 1998; Gooi and Allan, 2004; Singh et al.,2011; Dutta and Weikum, 2015), which disam-biguates mentions via clustering, does not sufferfrom this shortcoming. Instead most CDC algo-rithms suffer from a different failure mode: lackof scalability. Since they run expensive clusteringroutines over the entire set of mentions, they arenot well suited to applications where mentions ar-rive one at a time. There are, however, a subset ofstreaming CDC methods that avoid this issue byclustering mentions incrementally (Figure 1). Un-fortunately, despite such methods’ apparent fitnessfor streaming data scenarios, this area of researchhas received little attention from the NLP commu-nity. To our knowledge there are only two existingworks on the task (Rao et al., 2010; Shrimpton et al.,2015), and only the latter evaluates truly streamingsystems, i.e., systems that process new mentions inconstant time with constant memory.

One crucial factor limiting research on this topicis a lack of free, publicly accessible benchmarkdatasets; datasets used in existing works are eithersmall and impossible to reproduce (e.g., the datasetcollected by Shrimpton et al. (2015) only containsa few hundred unique entities, and many of the

Page 2: Benchmarking Scalable Methods for Streaming Cross Document

4718

SARS - CoV fusion peptides induce membrane surface ordering and

curvature...

...gather information on the membrane fusion mechanism promoted by two

putative SARS FPs...

Mild encephalitis/encephalopathy with reversible splenial lesion (MERS)

associated with...

...Middle East respiratory syndrome (MERS) outbreaks have been linked to

healthcare facilities...

Time

(a) New mentions arrive over time.

(b) Mentions are encoded as points in a vector space and incrementally clustered. As the space grows some points are removedto ensure that the amount of memory used does not exceed a given threshold.

Figure 1: Streaming Cross-Document Coreference.

annotated tweets are no longer available for down-load) or lack the necessary canonical ordering andare expensive to procure (e.g., the ACE 2008 andTAC-KBP 2009 corpora used by Rao et al. (2010)).To remedy this, we compile a benchmark of threedatasets for evaluating English streaming CDC sys-tems along with a canonical ordering in which eval-uation data should be processed. These datasetsare derived from existing datasets that cover di-verse subject matter: biomedical texts (Mohan andLi, 2019), news articles (Hoffart et al., 2011), andWikia fandoms (Logeswaran et al., 2019).

We evaluate a number of novel and existingstreaming CDC systems on this benchmark. Oursystems utilize a two step approach where: 1) eachmention is encoded using a neural or feature-basedmodel, and 2) the mention is then clustered withexisting mentions using an incremental clusteringalgorithm. We investigate the performance of dif-ferent mention encoders (existing feature-basedmethods, pretrained LMs, and encoders from en-tity linkers such as RELIC (Ling et al., 2020) andBLINK (Wu et al., 2020)), and incremental clus-tering algorithms (greedy nearest-neighbors clus-tering, and a recently introduced online agglomera-tive clustering algorithm, GRINCH (Monath et al.,2019)). Since GRINCH does not use boundedmemory, which is required for scalability in thestreaming setting, we introduce a novel boundedmemory variant that prunes nodes from the clus-ter tree when the number of leaves exceeds agiven size, and compare its performance to existingbounded memory approaches.

Our results show that the relative performanceof different mention encoders and clustering al-gorithms varies across different domains. We

find that existing approaches for streaming CDC(e.g., feature-based mention encoding with greedynearest-neighbors clustering) outperform neural ap-proaches on two of three datasets (+1-3% abs. im-provement in CoNLL F1), while a RELIC-basedencoder with GRINCH performs better on the lastdataset (+9% abs. improvement in CoNLL F1). Incases where existing approaches perform well, wealso find that better performance can be obtainedby using a combination of neural and feature-basedmention encoders. Lastly, we observe that by us-ing relatively simple memory management policies,e.g. removing old and redundant mentions fromthe mention cache, bounded memory models canachieve performance near on-par with unboundedmodels while storing only a fraction of the men-tions (in one case we observe a 2% abs. drop inCoNLL F1 caching only 10% of the mentions).

2 Streaming Cross-Document EntityCoreference (CDC)

2.1 Task OverviewThe key goal of cross-document entity coreference(CDC) is to identify mentions that refer to the sameentity. Formally, let M =

{m1, . . . ,m|M|

}de-

note a corpus of mentions, where each mention con-sists of a surface text m.surface (e.g., the coloredtext in Figure 1a), as well as its surrounding con-text m.context (e.g., the text in black). ProvidedM as an input, a CDC system produces a disjointclustering over the mentions C =

{C1, . . . , C|C|

},

|C| ≤ |M|, as the output, where each clusterCe = {m ∈M|m.entity = e} is the set of men-tions that refer to the same entity.

In streaming CDC, there are two additional re-quirements: 1) mentions arrive in a fixed order

Page 3: Benchmarking Scalable Methods for Streaming Cross Document

4719

(M is a list) and are clustered incrementally, and2) memory is constrained so that only a fixed num-ber of mentions can be stored. This can be formu-lated in terms of the above notation by adding atime index t, so thatMT = {mt ∈M| t ≤ T} isthe set of all mentions observed at or before timeT , MT ⊆ MT is a subset of “active” mentionswhose size does not exceed a fixed memory boundk, e.g., |MT | ≤ k, and CT is comprised of clus-ters that only contain mentions in MT . Due tothe streaming nature, MT − {mT } ⊂ MT−1, i.e.,a mention cannot be added back to MT if it waspreviously removed. When the memory bound isreached, mention are removed from M accordingto a memory management policy Φ.

An illustrative example is provided in Figure 1.Mentions arrive in left-to-right order (Figure 1a),with the clustering process depicted in Figure 1b(memory bound k = 3). At time T = 4, the men-tion m1 is removed from M4. Note that, eventhough m1 is removed, it is still possible to disam-biguate mentions of all previously observed entities,whereas this would not be possible had m3 or m4

been removed. This illustrates the effect the mem-ory management policy can have on performance.

2.2 Background and Motivation

Cross Document Entity Coreference As weshow later, we employ a two-stage CDC pipelinewhere mentions are first encoded as vectors, andsubsequently clustered. This approach is used inmost existing work on CDC (Bagga and Baldwin,1998; Mann and Yarowsky, 2003; Gooi and Al-lan, 2004). In the past decade, research on CDChas mainly focused in improving scalability (Singhet al., 2011), and jointly learning to perform CDCwith other tasks such as entity linking (Dutta andWeikum, 2015) and event coreference (discussedin the next paragraph). This work similarly investi-gates whether entity linking is beneficial for CDC,however we use entity linkers that are pretrainedseparately and kept fixed during inference.

Recently, there has been a renewed interest inperforming CDC jointly with cross-document eventcoreference (Barhom et al., 2019; Meged et al.,2020; Cattan et al., 2020; Caciularu et al., 2021) onthe ECB+ dataset (Cybulska and Vossen, 2014). Al-though we do not evaluate methods from this line ofresearch in this work, we hope that the benchmarkwe compile will be useful for future evaluation ofthese systems.

Streaming Cross Document Coreference Themethods mentioned in the previous paragraphs dis-ambiguate mentions all at once, and are thus un-suitable for applications where a large number ofmentions appear over time. Rao et al. (2010) pro-pose to address this issue using an incremental clus-tering approach where each new mention is eitherplaced into one of a number of candidate clusters,or a new cluster if similarity does not exceed agiven threshold (Allaway et al. (2021) use a similarapproach for joint entity and event coreference).Shrimpton et al. (2015) note that the this incremen-tal clustering does not process mentions in constanttime/memory, and thus is not “truly streaming”.They present the only truly streaming approach forCDC by introducing a number of memory manage-ment policies that limit the size of M, which wedescribe in more detail in Section 3.3.

One of the key problems inhibiting further re-search on streaming CDC is a lack of suitable eval-uation datasets for measuring system performance.The datasets used in Rao et al. (2010) are eithersmall in size (few hundreds of mentions), containfew annotated entities, or are expensive to procure.Additionally, they do not include any canonicalordering of the mentions, which precludes consis-tent evaluation of streaming systems. Meanwhile,Tweets annotated by Shrimpton et al. (2015) onlycover two surface texts (Roger and Jessica) andare no longer accessible via the Twitter API.2 Toaddress this we collect a new evaluation bench-mark, comprised of 3 existing publicly availabledatasets, covering a diverse collection of topics(News, Biomedical Articles, Wikias) with naturalorderings (e.g., chronological, categorical). Thisbenchmark is described in detail in Section 4.1.

Entity Linking CDC is similar to the task of en-tity linking (EL, Mihalcea and Csomai (2007)),which also addresses the problem of named entitydisambiguation, with the key distinction that EL isformulated as a supervised classification problem(list of entities is known at training and test time),while CDC is an unsupervised clustering prob-lem. In particular, CDC is similar to time-awareEL (Agarwal et al., 2018)—where temporal con-text is used to help disambiguate mentions—andzero-shot EL (Zeshel, Logeswaran et al. (2019))—where the set of entities linked to during evaluationdoes not overlap with the set of entities observedduring training. Streaming CDC can also be con-

2At this time, only 56 of the first 100 tweets were available.

Page 4: Benchmarking Scalable Methods for Streaming Cross Document

4720

sidered a method for time/order-aware zero-shotnamed entity disambiguation, however, it is strictlymore challenging as it does not assume access toa curated list of entities at prediction time, or anysupervised training data.

Although CDC is formulated as a strictly unsu-pervised clustering task, this does not preclude theusage of labeled data for transfer learning. Oneof the primary goals in this work is to investigatewhether the mention encoders learned by entitylinking systems provide useful representations inthe first step of the CDC pipeline. Specifically, weconsider mention encoders for two state-of-the-artentity linking architectures: RELIC (Ling et al.,2020) and the BLINK bi-encoder (Wu et al., 2020).

Emerging Entity Detection Streaming CDC isalso related to the task of emerging entity detection(EED, Farber et al. (2016)), which, given a men-tion that cannot be linked, seeks to predict whetherit should produce a new KB entry. Although bothtasks share similar motivations, they adopt differentapproaches (EED is formulated as a binary classi-fication task), and CDC does not require decidingwhich entities should and should not be added toa knowledge base. However, in many practical ap-plications, it may make sense to apply streamingCDC only to emerging entities.

3 Building Streaming CDC Systems

Following previous work, we adopt a two-step ap-proach to performing streaming cross-documentcoreference. In the first step, an encoder is usedto produce a vector representation of the incom-ing mention mt = Enc(mt). In the second step,these vectors are input into an incremental clus-tering algorithm to update the predicted clusteringCt = Clust(Ct−1,mt). In the following sectionswe describe in detail the mention encoders andclustering algorithms used in this work.

3.1 Mention Encoders

The primary goal of mention encoders Enc(mt) isto produce a compact representation of the mention,including both the surface and the context text.

Feature-Based Encoders Existing models forstreaming cross-document coreference exclusivelymake use of feature-based mention encoders.While there are many feature engineering optionsexplored in the literature, in this work we con-sider the mention encoding approach proposed by

Shrimpton et al. (2015), which uses character skipbigram indicator vectors to encode the surface text,and tf-idf vectors to represent contexts. When usingthis encoding scheme, similarity scores are com-puted independently for the surface and contextembeddings, and a weighted average is taken toproduce the final similarity score. We use the samesetup and parameters as Shrimpton et al. (2015).

Masked Language Model Encoders We alsoconsider mention encodings produced by maskedlanguage models, particularly BERT (Devlin et al.,2019). We encode the mention by feeding the con-tiguous text of the mention (containing both thesurrounding and surface text) into BERT and con-catenating the contextualized vectors associatedwith the first and last word-piece of the surface text.That is, let s, e ∈ N denote the start and end of themention surface text within the complete mention,and let M = BERT(m) denote the contextualizedword vectors output by BERT. Then the mention en-coding is given by: EncMLM(m) = [M [s];M [e]].

Entity Linker-Based Encoders We considerproducing mention encodings using the bi-encoder-based neural entity linkers: RELIC (Ling et al.,2020) and BLINK (Wu et al., 2020). Thebi-encoder architecture is comprised of twocomponents—a mention encoder Encm, and anentity encoder Ence—and is trained to maximizea similarity score (e.g., dot-product) between themention encoding and the encoding of its under-lying entity, while simultaneously minimizing thescore for other entities. We use Encm from pre-trained entity linkers to encode mentions for CDC.

Hybrid Encoder We also consider a hybrid en-coder which combines feature-based and neuralmention encoders. We retain the feature-based char-acter skip bigram surface text encoder, but use oneof the neural encoders from entity linkers in placeof tf-idf context representation. Similarity scoresare computed by averaging the two without anyweights, unlike by Shrimpton et al. (2015).

3.2 Clustering Algorithms

Here we describe incremental clustering ap-proaches, Clust(Ct−1,mt), that compute a newclustering when mt is added to the mentions underconsideration (M).

Greedy Nearest Neighbors Clustering Shrimp-ton et al. (2015) and Rao et al. (2010) both evaluate

Page 5: Benchmarking Scalable Methods for Streaming Cross Document

4721

(a) Greedy Agglomerative Clustering (b) GRINCH (c) Bounded GRINCH

Figure 2: Bounded GRINCH. (a) Greedy agglomerative clustering produces a sub-optimal tree structure dueto the order points are received. (b) The GRINCH (Monath et al., 2019) algorithm recovers from this mistakeby reconfiguring the tree structure (in this case using a rotation operation). (c) To ensure the memory used byGRINCH remains bounded, we add a new operation—remove—that prunes two leaf nodes when the number ofleaves exceeds a given size. Nodes are selected for removal using a scoring function φr. In this case nodes m1 andm2 are selected, and their parent m1,2 becomes a new leaf.

CDC using a single linkage incremental clusteringapproach that clusters each new mention m to itsnearest neighbor m′ = arg min

m′∈M sim(m,m′),if the similarity exceeds some threshold τ . Weuse a similar approach here, however we cluster mwith all m′ ∈ M such that sim(m,m′) > τ thusallowing previously separate clusters to be mergedif m is similar to both of them.

GRINCH Gooi and Allan (2004) find that aver-age link hierarchical agglomerative clustering canoutperform greedy single link approaches. How-ever, agglomerative approaches are typically notused for streaming CDC because running the al-gorithm at each time step is too expensive, andincremental variants of the approach are not able torecover from incorrect choices made early on (Fig-ure 2a). The recently introduced GRINCH cluster-ing algorithm (Monath et al., 2019) uses rotateand graft operations that reconfigure the tree,thereby avoiding these issues (Figure 2b). We de-fer to the original paper for details, however notethat, for our application, each interior node of thecluster tree is computed as a weighted average ofits children’s representations (where the weightsare proportional to the number of leaves). Thusat each interior node, it is possible to compute thesimilarity score between that node’s children. Thisallows us to produce a flat clustering from the clus-ter tree by thresholding the similarity score, just asin the greedy clustering case.

3.3 Memory Management Policies

As described in Section 2.1, memory managementpolicies decide which mentions to remove from

M to prevent its size from exceeding the memorybound, providing scalable, memory-bound variantsof the clustering algorithms.

Bounded Memory Greedy NN Clustering Forbounded memory greedy nearest neighbors cluster-ing, we consider the following memory manage-ment policies of Shrimpton et al. (2015):• Window: Remove the oldest mention in M.• Cache: Remove the oldest mention in the least

recently updated cluster CLRU.• Diversity: Remove the most similar mention to

mention just added, i.e. arg maxm sim(m,mt)• Diversity-Cache: A combination of the diversity

and cache strategies, where the diversity strategyis used if the similarity score exceeds a giventhreshold sim(m,mt) > α, and the cache strat-egy is used otherwise.

Bounded Memory GRINCH Memory manage-ment for GRINCH is more complicated than forgreedy clustering, as instead of maintaining a flatclustering of mentions, GRINCH instead maintainsa cluster hierarchy in the form of a binary clus-ter tree. Every time a mention is inserted into thetree, two new nodes are created: one node for themention itself, and a new parent node linking themention to its sibling (Figure 2a). Accordingly,when the memory bound is reached, the memorymanagement policy for GRINCH must remove twonodes from the tree. Furthermore, in order to pre-serve the tree’s binary structure, the removed nodesmust be leaf nodes as well as siblings. Because theoriginal GRINCH algorithm only includes routinesfor inserting nodes into the tree, and reconfigur-ing the tree’s structure, we modify GRINCH to

Page 6: Benchmarking Scalable Methods for Streaming Cross Document

4722

|M| |E| % Seen MAE

AIDATrain 18.5K 4.1K 100% 1.1KDev 4.8K 1.6K 23% 290Test 4.5K 1.6K 16% 263

MedMentionsTrain 121K 18K 100% 4.7KDev 42K 8.8K 27% 1.8KTest 39K 8.3K 26% 1.7K

ZeshelTrain 81K 32K 100% 9.3KDev 18K 7.5K 0% 2.9KTest 17K 7.2K 0% 3.3K

Table 1: Dataset Statistics. |M|: #mentions, |E|:#unique entities, % Seen: fraction of entities observedduring training, MAE: maximum active entities, e.g.,the number of mentions an ideal streaming CDC sys-tem would need to store to perfectly cluster the data.

include a new remove operation that prunes twonodes satisfying the these criteria. The parent ofthese nodes then becomes a leaf node, whose vec-tor representation is produced by combining thevector representations of its former children usinga weighted average (this is conceptually similar tothe collapse operation described in Kobren et al.(2017)). We consider the following policies here:• Window: Remove the nodes whose parent was

least recently added to the tree.• Diversity: Remove the pair of nodes that are most

similar to each other.

4 Benchmarking Streaming CDC

In this section, we describe our proposed bench-mark for evaluating streaming CDC systems.

4.1 DatasetsCurrent research on CDC is inhibited by a lackof large, publicly accessible datasets. We addressthis by compiling datasets for streaming CDC byadapting existing entity linking datasets: AIDACoNLL-YAGO, MedMentions, and Zeshel.

AIDA AIDA CoNLL-YAGO (Hoffart et al.,2011) contains news articles from the Reuters Cor-pus written between August and December 1996with annotations linking mentions to YAGO andWikipedia. We create a canonical ordering for thisdataset by ordering articles by date. As the originaltrain, dev, and test splits respect this ordering, weuse the original splits in our benchmark.

MedMentions The MedMentions (Mohan andLi, 2019) corpus contains abstracts for biomedical

articles published to PubMed in 2016, and anno-tated with links to the UMLS medical ontology.We order abstracts by publication date3 to create acanonical ordering. Since the original dataset is notordered by date, we create new train, dev, and testsplits of comparable size that respect this ordering.

Zeshel The Zeshel (Logeswaran et al., 2019)dataset consists of Wikia articles for different FAN-DOMs. In addition to the original set of annotatedmentions, we use the provided entity descriptionsas an additional source of mentions. We impose anordering that groups all mentions belonging to thesame Wikia together, and otherwise retains theiroriginal order in the Zeshel data. This is an interest-ing scenario for streaming CDC as no clusters needbe retained when transitioning to a new Wikia.

Analysis Statistics for the benchmark data areprovided in Table 1, which list the number of men-tions and unique entities for each dataset. We alsolist the percentage overlap between entities in thetraining set, and entities in the dev and test sets(% Seen), as well as the maximum active entities(MAE). MAE is a quantity introduced by Toshni-wal et al. (2020), which measures the maximumnumber of “active entities” (e.g., entities that havebeen previously mentioned, and will be mentionedin the future) for a given dataset, which can alterna-tively be interpreted as the smallest possible mem-ory bound that can be used in order to ensure that aCDC system can cluster each mention with at leastone other mention of the same entity. Importantly,this number is a small fraction of the total numberof mentions in each dataset, indicating that thesedatasets are appropriate for the streaming settingand to compare memory management policies.

4.2 Evaluation Metrics

We evaluate CDC performance using the standardevaluation metrics: MUC (Vilain et al., 1995),B3 (Bagga and Baldwin, 1998), CEAFe (Luo,2005), and CoNLL F1 which is an average ofthe previous three. In order to perform evaluationwhen memory is bounded, we perform the follow-ing bookkeeping to track nodes which have beenremoved by the memory management policy. Forbounded memory greedy NN clustering, we keeptrack of the removed node’s predicted cluster (e.g.,if the node was removed from cluster C, then itis considered an element of C during evaluation).

36 abstracts were omitted due to missing metadata

Page 7: Benchmarking Scalable Methods for Streaming Cross Document

4723

This is similar to the evaluation used by Toshni-wal et al. (2020). For bounded memory GRINCH,we keep track of the removed node’s place withinthe tree structure, and produce a flat clustering us-ing the thresholding approach described in Sec-tion 3.2 as if the node were never removed. Be-cause leaf nodes (and accordingly removed nodes)are never updated by insertion or removal opera-tions, nodes belonging to the same cluster beforethey are pruned they will always remain in thesame cluster during evaluation, which is the sameassumption used for the greedy NN evaluation.

4.3 Hyperparameters

Vocabulary and inverse document frequency (idf)weights are estimated using each dataset’s train set.For masked language model encoders, we use anunmodified BERT-base architecture, with modelweights provided by the HuggingFace transformerslibrary (Wolf et al., 2020). For BLINK, we usethe released BERT-large bi-encoder weights.4 Ourbounded memory variant of GRINCH is based onthe official implementation.5 Note that GRINCHdoes not currently support sparse inputs, so wedo not include results for feature-based mentionencoders. RELIC model weights are initializedfrom BERT-base, and then finetuned to performentity linking in the following settings:• RELIC (wiki): Trained on the same Wikipedia

data used to train the BLINK bi-encoder.• RELIC (in-domain): Trained on respective bench-

mark’s training dataset; a separate model istrained for each benchmark.

Training is performed using hyperparameters sug-gested by Ling et al. (2020).6 For each benchmark,the hybrid mention encoder uses the best perform-ing RELIC variant on that benchmark. Clusterthresholds τ are chosen so that the number of pre-dicted clusters on the dev dataset approximatelymatches the number of unique entities.

5 Results

In this section, we provide a comprehensive evalu-ation of the design choices that define the existingand proposed approaches for streaming CDC.

4https://github.com/facebookresearch/BLINK

5https://github.com/iesl/grinch6Trained on a server w/ 754 GB RAM, Intel Xeon Gold

5218 CPU and 4x NVIDIA Quadro RTX 8000 GPUs.

Choice of Encoder We include the results forCDC systems with unbounded memory on thebenchmark datasets in Table 2, as well as resultsfor two baselines: 1) a system that clusters togetherall mentions with the same surface forms (exactmatch), and 2) a system that only considers goldwithin-document clusters and does not merge clus-ters across documents (oracle within-doc). Weobserve that, in general, neural mention encodersare not sufficient to obtain good CDC performance.With the exception of the RELIC (In-Domain) onMedMentions, no neural mention encoders are ableto outperform the feature-based greedy NN ap-proach, and furthermore, the MLM and BLINKmention encoders do not even surpass the exactmatch baseline. However, note that for AIDAand Zeshel, best results are obtained using a hy-brid mention encoder. Thus, in these domains, wecan conclude that while neural mention encodersare useful for encoding contexts, CDC systems re-quire an additional system to model surface textsto achieve good performance. The results on Med-Mentions provide an interesting contrast to thisconclusion. Here the RELIC (In-Domain) mentionencoder outperforms both the feature-based andhybrid mention encoders. In the error analysis be-low, we find that this is due mainly to improvedperformance clustering mentions of entities seenwhen training the mention encoder.

Choice of Clustering Algorithm Comparinggreedy nearest neighbors clustering to GRINCH,we do not observe a consistent trend across mentionencoders or datasets. While the best performanceon AIDA and Zeshel is achieved using greedy near-est neighbor clustering, the best performance onMedMentions is achieved using GRINCH. Theseresults highlight the importance of benchmarkingCDC systems on a number of different datasets;patterns observed on a single dataset do not extrap-olate well to other settings. It is also interesting toobserve that a much simpler approach often worksbetter than the more complex one.

Error Analysis We characterize the errors ofthese models by investigating: a) the entities whosementions are conflated (e.g., are wrongly clusteredtogether) and split (e.g., wrongly grouped into sep-arate clusters) using the approach of Kummerfeldand Klein (2013), and b) differences in perfor-mance on entities that are seen vs. unseen duringtraining for models that use in-domain data. A sub-

Page 8: Benchmarking Scalable Methods for Streaming Cross Document

4724

AIDA MedMentions Zeshel

MUC B3 CEAF Avg. MUC B3 CEAF Avg. MUC B3 CEAF Avg.

Exact Match 90.2 84.1 81.0 85.1 78.8 66.0 54.0 66.3 28.3 64.3 46.4 46.3Oracle Within-Doc 15.2 46.8 47.1 36.4 16.5 32.8 34.8 28.0 - - - -

Greedy NNFeature-Based 94.2 89.0 87.3 90.2 83.6 67.0 58.0 69.5 39.6 60.9 52.6 51.0MLM 75.9 71.1 58.1 68.4 70.8 52.1 42.0 55.0 16.2 53.8 44.8 38.3BLINK (Wiki) 58.2 56.3 56.6 57.0 59.4 39.2 43.2 47.3 36.6 36.9 40.9 41.6RELIC (Wiki) 92.4 89.4 83.6 88.5 73.2 56.1 42.4 57.2 36.2 58.1 48.7 47.7RELIC (In-Domain) 93.2 80.7 84.5 86.1 86.8 69.5 62.4 72.9 28.2 61.4 42.5 44.0Hybrid 94.7 90.1 88.5 91.1 85.6 70.5 59.9 72.0 44.0 64.5 53.3 54.0

GRINCHMLM 37.8 59.2 41.5 46.2 70.8 52.1 42.0 55.0 49.0 38.0 33.1 40.0BLINK (Wiki) 64.3 26.9 23.2 38.1 83.2 17.1 11.9 37.4 45.6 24.8 21.7 30.7RELIC (Wiki) 91.6 88.3 82.5 87.5 73.9 57.9 42.2 58.0 72.6 4.2 4.3 27.0RELIC (In-Domain) 82.8 84.0 69.5 78.8 85.4 73.3 61.8 73.5 27.3 57.5 40.1 41.6

Table 2: Unbounded Memory Results. CoNLL F1 scores for each valid combination of clustering algorithmand mention encoder. Similiarity threshold clustering + hybrid mention encoder works best on AIDA and Zeshel,whereas GRINCH clustering + in-domain RELIC mention encoder works best for MedMentions.

Feature-Based + Greedy NN

FIFA World Cup . . . ’ ’ Japan , co-hosts of the World Cup in 2002 and ranked 20th in the world by . . .1995 Rugby World Cup . . . team . Cuttitta announced his retirement after the 1995 World Cup , where he took issue with

being dropped from . . .Rugby World Cup . . . Australia to defeat with a last-ditch drop-goal in the World Cup quarter-final in Cape Town . . .

RELIC (Wiki) + Greedy NN

FC Volendam . . . leaders PSV Eindhoven romped to a 6-0 win over Volendam on Saturday . Their othermarksmen were Brazilian . . .

Feyenoord . . . game . They boast a nine-point lead over Feyenoord , who have two games in hand , and . . .RKC Waalwijk . . . division soccer match played on Friday : RKC Waalwijk 1 ( Starbuck 76 ) Willem II . . .

Table 3: Most Conflated Entities on AIDA. Left: Unique entity ID. Right: Mention with entity surface form initalics. Results for remaining models are provided in the Appendix.

set of our results is provided in Table 3, with fullresults available in Tables 4–11 in the Appendix.

In aggregate, these error metrics closely track theresults in Table 2, where better models make fewererrors of all types. We do, however, observe thatin-domain training improves RELIC’s performanceconsiderably on MedMentions (+15 CoNLL F1 onseen entities, and +18 on unseen entities), and isthe primary reason underlying the improved perfor-mance over feature-based encoders (72.6 vs. 60.7CoNLL F1 on seen entities, while performance onunseen entities is comparable).

Comparing mentions of the most conflated en-tities provides a qualitative sense of the failuremodes of each method. We note that the feature-based method tends to fail at distinguishing entitieswith the same surface form, e.g., world cups ofdifferent sports, while neural entity linkers tendto conflate entities with similar contexts, partic-ularly when surface forms are split into multiple

word pieces in the model’s vocabulary (each sur-face form in the bottom of Table 3 gets broken into3+ word pieces).

Effect of Bounded Memory Results for thebounded memory setting are illustrated in Figure 3.In these experiments we take the best neural men-tion encoder for each benchmark dataset (RELIC(Wiki) for AIDA and Zeshel, and RELIC (In-Domain) for MedMentions), and plot the CoNLLF1 score for each of the memory management poli-cies described in Section 3.3. We measure perfor-mance for memory bounds at the maximum num-ber of active entities (MAE) and total unique en-tities (|E|) for each dataset (as well as 1/2x, and2x multiples of these numbers). In sum, these re-sults provide strong evidence that CDC systemscan reliably cluster mentions in a truly streamingsetting, even when memory is bounded to a smallfraction of the number of entities encountered bythe system. Most impressively, using the diversity-

Page 9: Benchmarking Scalable Methods for Streaming Cross Document

4725

cache memory management policy, a greedy near-est neighbors bounded memory model achieves aCoNLL F1 score within 2% of the best perform-ing unbounded memory model, while only storingapproximately 10% (i.e., E /2) of the mentions.

We notice a few fairly consistent trends acrossdatasets. The first is that increasing the memorybound has diminishing returns; while there is alarge benefit incurred by increasing the bound fromMAE/2 to MAE, the difference in performanceattained from increasing the bound from E to 2E isoften negligible. We also find that naıve memorymanagement policies that store recent mentions(i.e., window, W, and cache, C) tend to performbetter than the policies that attempt to remove re-dundant mentions (i.e., diversity, D). This effect isparticularly pronounced for small memory bounds.While this is somewhat surprising—storing men-tions of the same entity is particularly harmfulwhen memory is limited, so encouraging diversityshould be a good thing—one possible explanationis that the diversity policy is actually removingmentions of entities that appear within the samecontext, as we saw earlier that neural mention en-coders appear to focus more on mention contextthan surface text. Lastly, regarding the comparisonof greedy nearest neighbors clustering to GRINCHwe again see that inconsistency in performanceacross datasets; GRINCH appears to perform betterat larger cache sizes for AIDA and MedMentions,while greedy nearest neighbors clustering has muchbetter performance than GRINCH on Zeshel.

6 Conclusion and Future Work

Streaming cross document coreference has a num-ber of compelling applications, especially concern-ing processing streams of data such as researchpublications, social media feeds, and news articleswhere new entities are frequently introduced. De-spite being well-motivated, this task has receivedlittle attention from the NLP community. In or-der to foster a more welcoming environment forresearch on this task, we compile a diverse bench-mark dataset for evaluating CDC, comprised ofexisting datasets that are free and publicly avail-able. We additionally evaluate the performanceof a collection of existing approaches for CDC,as well as introduce new approaches that leveragemodern neural architectures. Our results highlighta number of challenges for future CDC research,such as how to better incorporate surface level fea-

MAE |E| 2|E|60

70

80

90

CoN

LLF

1(%

)

AIDA

MAE |E| 2|E|50

60

70

80

CoN

LLF

1(%

)

MedMentions

MAE |E| 2|E|35

40

45

50

CoN

LLF

1(%

)

ZESHEL

GRINCH (D)Greedy NN (W)Greedy NN (D)

GRINCH (W)Greedy NN (C)Greedy NN (D-C)

Figure 3: Effect of Bounded Memory. CoNLL F1

scores as we vary the bound. MAE: Max. active enti-ties (defined in Section 4.1). |E|: #unique entities.

tures into neural mention encoders, as well as al-ternative policies for memory management that im-prove upon the naıve baselines studied in this work.Benchmark data and materials needed to reproduceour results are provided at: https://github.com/rloganiv/streaming-cdc.

Acknowledgements

The authors would like to thank Sanjay Subrama-nian, Nitish Gupta, Keith Hall, Ryan McDonald,Livio Baldini Soares, Nicholas FitzGerald, andTom Kwiatkowski for their technical guidance andhelpful comments while conducting this work. Wewould also like to thank thank the anonymous ACLreviewers for their valuable feedback. This projectis supported in part by NSF award no. 1817183,and the DARPA MCS program under Contract No.N660011924033.

Page 10: Benchmarking Scalable Methods for Streaming Cross Document

4726

Broader Impact Statement

This paper focuses on systems that perform en-tity disambiguation without reliance on an externalknowledge base. The potential benefit of such sys-tems is an improved ability to track mentions ofrare and emergent entities (e.g., natural disasters,novel disease variants, etc.); however, this is alsorelevant in digital surveillance settings, and mayresult in reduced privacy.

ReferencesPrabal Agarwal, Jannik Strotgen, Luciano del Corro,

Johannes Hoffart, and Gerhard Weikum. 2018. di-aNED: Time-aware named entity disambiguation fordiachronic corpora. In Proceedings of the 56th An-nual Meeting of the Association for ComputationalLinguistics (Volume 2: Short Papers), pages 686–693, Melbourne, Australia. Association for Compu-tational Linguistics.

Emily Allaway, Shuai Wang, and Miguel Ballesteros.2021. Sequential cross-document coreference reso-lution. arXiv preprint arXiv:2104.08413.

Amit Bagga and Breck Baldwin. 1998. Entity-based cross-document coreferencing using the vec-tor space model. In 36th Annual Meeting of theAssociation for Computational Linguistics and 17thInternational Conference on Computational Linguis-tics, Volume 1, pages 79–85, Montreal, Quebec,Canada. Association for Computational Linguistics.

Shany Barhom, Vered Shwartz, Alon Eirew, MichaelBugert, Nils Reimers, and Ido Dagan. 2019. Re-visiting joint modeling of cross-document entity andevent coreference resolution. In Proceedings of the57th Annual Meeting of the Association for Com-putational Linguistics, pages 4179–4189, Florence,Italy. Association for Computational Linguistics.

Avi Caciularu, Arman Cohan, Iz Beltagy, Matthew EPeters, Arie Cattan, and Ido Dagan. 2021. Cross-document language modeling. arXiv preprintarXiv:2101.00406.

Arie Cattan, Alon Eirew, Gabriel Stanovsky, MandarJoshi, and Ido Dagan. 2020. Streamlining cross-document coreference resolution: Evaluation andmodeling. arXiv preprint arXiv:2009.11032.

Agata Cybulska and Piek Vossen. 2014. Using asledgehammer to crack a nut? lexical diversityand event coreference resolution. In Proceedingsof the Ninth International Conference on LanguageResources and Evaluation (LREC’14), pages 4545–4552, Reykjavik, Iceland. European Language Re-sources Association (ELRA).

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training of

deep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers),pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.

Sourav Dutta and Gerhard Weikum. 2015. Cross-document co-reference resolution using sample-based clustering with knowledge enrichment. Trans-actions of the Association for Computational Lin-guistics, 3:15–28.

Michael Farber, Achim Rettinger, and Boulos El As-mar. 2016. On emerging entity detection. InKnowledge Engineering and Knowledge Manage-ment, pages 223–238, Cham. Springer InternationalPublishing.

Chung Heong Gooi and James Allan. 2004. Cross-document coreference on a large scale corpus. InProceedings of the Human Language TechnologyConference of the North American Chapter of theAssociation for Computational Linguistics: HLT-NAACL 2004, pages 9–16, Boston, Massachusetts,USA. Association for Computational Linguistics.

Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bor-dino, Hagen Furstenau, Manfred Pinkal, Marc Span-iol, Bilyana Taneva, Stefan Thater, and GerhardWeikum. 2011. Robust disambiguation of named en-tities in text. In Proceedings of the 2011 Conferenceon Empirical Methods in Natural Language Process-ing, pages 782–792, Edinburgh, Scotland, UK. Asso-ciation for Computational Linguistics.

Ari Kobren, Nicholas Monath, Akshay Krishnamurthy,and Andrew McCallum. 2017. A hierarchical al-gorithm for extreme clustering. In Proceedings ofthe 23rd ACM SIGKDD International Conference onKnowledge Discovery and Data Mining, pages 255–264.

Nikolaos Kolitsas, Octavian-Eugen Ganea, andThomas Hofmann. 2018. End-to-end neural entitylinking. In Proceedings of the 22nd Conferenceon Computational Natural Language Learning,pages 519–529, Brussels, Belgium. Association forComputational Linguistics.

Jonathan K. Kummerfeld and Dan Klein. 2013. Error-driven analysis of challenges in coreference resolu-tion. In Proceedings of the 2013 Conference onEmpirical Methods in Natural Language Processing,pages 265–277, Seattle, Washington, USA. Associa-tion for Computational Linguistics.

Jeffrey Ling, Nicholas FitzGerald, Zifei Shan,Livio Baldini Soares, Thibault Fevry, David Weiss,and Tom Kwiatkowski. 2020. Learning cross-context entity representations from text. arXivpreprint arXiv:2001.03765.

Page 11: Benchmarking Scalable Methods for Streaming Cross Document

4727

Lajanugen Logeswaran, Ming-Wei Chang, Kenton Lee,Kristina Toutanova, Jacob Devlin, and Honglak Lee.2019. Zero-shot entity linking by reading entity de-scriptions. In Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguistics,pages 3449–3460, Florence, Italy. Association forComputational Linguistics.

Xiaoqiang Luo. 2005. On coreference resolution per-formance metrics. In Proceedings of Human Lan-guage Technology Conference and Conference onEmpirical Methods in Natural Language Processing,pages 25–32, Vancouver, British Columbia, Canada.Association for Computational Linguistics.

Gideon Mann and David Yarowsky. 2003. Unsuper-vised personal name disambiguation. In Proceed-ings of the Seventh Conference on Natural LanguageLearning at HLT-NAACL 2003, pages 33–40.

Yehudit Meged, Avi Caciularu, Vered Shwartz, and IdoDagan. 2020. Paraphrasing vs coreferring: Twosides of the same coin. In Findings of the Associ-ation for Computational Linguistics: EMNLP 2020,pages 4897–4907, Online. Association for Computa-tional Linguistics.

Rada Mihalcea and Andras Csomai. 2007. Wikify!linking documents to encyclopedic knowledge. InProceedings of the sixteenth ACM conference onConference on information and knowledge manage-ment, pages 233–242.

Sunil Mohan and Donghui Li. 2019. Medmentions: Alarge biomedical corpus annotated with {umls} con-cepts. In Automated Knowledge Base Construction(AKBC).

Nicholas Monath, Ari Kobren, Akshay Krishnamurthy,Michael R. Glass, and Andrew McCallum. 2019.Scalable hierarchical clustering with tree grafting.In Proceedings of the 25th ACM SIGKDD Interna-tional Conference on Knowledge Discovery & DataMining, KDD ’19, page 1438–1448, New York, NY,USA. Association for Computing Machinery.

Delip Rao, Paul McNamee, and Mark Dredze. 2010.Streaming cross document entity coreference reso-lution. In Coling 2010: Posters, pages 1050–1058,Beijing, China. Coling 2010 Organizing Committee.

Luke Shrimpton, Victor Lavrenko, and Miles Osborne.2015. Sampling techniques for streaming cross doc-ument coreference resolution. In Proceedings ofthe 2015 Conference of the North American Chap-ter of the Association for Computational Linguis-tics: Human Language Technologies, pages 1391–1396, Denver, Colorado. Association for Computa-tional Linguistics.

Sameer Singh, Amarnag Subramanya, FernandoPereira, and Andrew McCallum. 2011. Large-scalecross-document coreference using distributed infer-ence and hierarchical models. In Proceedings of

the 49th Annual Meeting of the Association for Com-putational Linguistics: Human Language Technolo-gies, pages 793–803, Portland, Oregon, USA. Asso-ciation for Computational Linguistics.

Shubham Toshniwal, Sam Wiseman, Allyson Ettinger,Karen Livescu, and Kevin Gimpel. 2020. Learn-ing to Ignore: Long Document Coreference withBounded Memory Neural Networks. In Proceed-ings of the 2020 Conference on Empirical Methodsin Natural Language Processing (EMNLP), pages8519–8526, Online. Association for ComputationalLinguistics.

Marc Vilain, John Burger, John Aberdeen, Dennis Con-nolly, and Lynette Hirschman. 1995. A model-theoretic coreference scoring scheme. In Sixth Mes-sage Understanding Conference (MUC-6): Proceed-ings of a Conference Held in Columbia, Maryland,November 6-8, 1995.

Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, Remi Louf, Morgan Funtow-icz, Joe Davison, Sam Shleifer, Patrick von Platen,Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,Teven Le Scao, Sylvain Gugger, Mariama Drame,Quentin Lhoest, and Alexander Rush. 2020. Trans-formers: State-of-the-art natural language process-ing. In Proceedings of the 2020 Conference on Em-pirical Methods in Natural Language Processing:System Demonstrations, pages 38–45, Online. Asso-ciation for Computational Linguistics.

Ledell Wu, Fabio Petroni, Martin Josifoski, SebastianRiedel, and Luke Zettlemoyer. 2020. Scalable zero-shot entity linking with dense entity retrieval. InProceedings of the 2020 Conference on EmpiricalMethods in Natural Language Processing (EMNLP),pages 6397–6407, Online. Association for Computa-tional Linguistics.

Page 12: Benchmarking Scalable Methods for Streaming Cross Document

4728

A Error Analysis

A.1 Seen vs. Unseen PerformanceWe evaluate CoNLL F1 scores for mentions of en-tities that are seen vs. unseen in the AIDA andMedMentions training datasets (Zeshel is excludedsince no test entities are seen in the training data).Results are provided in Table 4, with performanceof models that are trained using the in-domain train-ing datasets reported in bold.

AIDA MedMentions

Seen Unseen Seen Unseen

Greedy NNFeature-Based 91.2 92.2 60.7 80.8MLM 67.2 71.7 54.5 61.5BLINK (Wiki) 39.0 39.2 45.1 59.0RELIC (Wiki) 89.3 89.9 57.6 62.2RELIC (In-Dom.) 87.1 86.4 72.6 80.3Hybrid 92.8 92.2 71.8 80.5

GRINCHMLM 46.2 43.0 30.6 24.9BLINK (Wiki) 39.0 39.2 38.1 33.7RELIC (Wiki) 88.2 89.1 58.4 62.8RELIC (In-Dom.) 84.2 70.3 73.1 78.1

Table 4: CoNLL F1 Scores on mentions of entities thatare seen vs. unseen in the AIDA and MedMentionstraining datasets. Zeshel is excluded since all entitiesin the test data are unseen. Bolded numbers indicatethat the mention encoder is trained on seen mentions.

A.2 Clustering MistakesKummerfeld and Klein (2013) define a system forcategorizing coreference errors into a number of un-derlying error types. Because gold mention bound-aries are provided in our task setup, the main errortypes of relevance are divided entities, i.e., men-tions of the same entity that occur in different clus-ters, and conflated entities, i.e., mentions of differ-ent entities that are grouped into the same clusters.We can quantify these error types by counting thenumber of times clusters need to be merged to-gether vs. split, respectively. The overall errorcounts are provided in Table 5.

In addition to providing the overall error counts,we also render a sample of mentions from predictedclusters containing the most conflated entities inTables 6–11.

AIDA MedMent. Zeshel

Confl. Div. Confl. Div. Confl. Div.

Greedy NNFeature-Based 173 156 5.0K 5.2K 5.4K 6.2KMLM 764 676 8.9K 9.1K 6.0K 8.6KBLINK (Wiki) 1.6K 749 13.1K 12.3K 7.4K 5.9KRELIC (Wiki) 243 209 8.1K 8.4K 5.8K 6.5KRELIC (In-Dom.) 178 219 4.1K 4.1K 3.5K 7.8KHybrid 155 156 4.4K 4.5K 4.5K 5.9K

GRINCHMLM 1.2K 829 8.4K 0 6.9K 5.2KBLINK (Wiki) 1.7K 749 7.9K 3.2K 8.0K 2.8KRELIC (Wiki) 284 214 7.8K 8.2K 6.8K 6.6KRELIC (In-Dom.) 101 794 3.3K 5.4K 2.9K 8.7K

Table 5: Clustering Mistakes. Conflated: Number oftimes mentions of different entities are grouped into thesame cluster. Divided: Number of times mentions ofthe same entity are grouped into different clusters.

Page 13: Benchmarking Scalable Methods for Streaming Cross Document

4729

Feature-Based

FIS Ski Jumping World Cup . . . ) 228.1 ( 129.4 / 98.7 ) Leading World Cup standings ( after three events ) : 1. . . .FIFA World Cup . . . his squad to face Macedonia next week in a World Cup qualifier . Midfielder Valentin Stefan and striker Viorel . . .1966 FIFA World Cup . . . 35 caps and was a key member of the 1966 World Cup winning team with his younger brother , Bobby . . . .

MLM

Sheffield Wednesday F002eC002e . . . . 0-1 . 19,306 Liverpool 0 Sheffield Wednesday 1 ( Whittingham 22 ) . 0-1 . . . .Aston Villa F002eC002e . . . 0 Leeds 0 . 30,018 Southampton 0 Aston Villa 1 ( Townsend 34 ) . 0-1 . . . .Newcastle United F002eC002e . . . Villa 17 9 3 5 22 15 30 Newcastle 15 9 2 4 26 17 29 Manchester . . .

BLINK (Wiki)

Japan national football team . . . SOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRISE DEFEAT . . . .China PR national football team . . . SOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRISE DEFEAT . Nadim Ladki AL-AIN . . .Al Ain . . . CHINA IN SURPRISE DEFEAT . Nadim Ladki AL-AIN , United Arab Emirates 1996-12-06 Japan began the . . .

RELIC (Wiki)

RKC Waalwijk . . . division soccer match played on Friday : RKC Waalwijk 1 ( Starbuck 76 ) Willem II Tilburg 2 . . .Willem II (football club) . . . : RKC Waalwijk 1 ( Starbuck 76 ) Willem II Tilburg 2 ( Konterman 45 , Van der Vegt . . .PSV Eindhoven . . . SOCCER - PSV HIT VOLENDAM FOR SIX . AMSTERDAM 1996-12-07 . . .

RELIC (In-Domain)

Japan national football team . . . SOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRISE DEFEAT . . . .China PR national football team . . . SOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRISE DEFEAT . Nadim Ladki AL-AIN . . .United Nations . . . 1975 and annexed the following year . The United Nations has never recognised Jakarta ’s move . Alatas . . .

Hybrid

FIFA World Cup . . . ’ ’ Japan , co-hosts of the World Cup in 2002 and ranked 20th in the world by . . .1966 FIFA World Cup . . . 35 caps and was a key member of the 1966 World Cup winning team with his younger brother , Bobby . . . .Rugby World Cup . . . Australia to defeat with a last-ditch drop-goal in the World Cup quarter-final in Cape Town . ” Campo has . . .

Table 6: Most Conflated Entities on AIDA using Greedy NN Clustering. Left: Unique entity ID. Right: Mentionwith entity surface form in italics.

MLM

Japan national football team . . . SOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRISE DEFEAT . . . .China PR national football team . . . SOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRISE DEFEAT . Nadim Ladki AL-AIN . . .Japan national football team . . . Ladki AL-AIN , United Arab Emirates 1996-12-06 Japan began the defence of their Asian Cup title with . . .

BLINK (Wiki)

Japan national football team . . . SOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRISE DEFEAT . . . .China PR national football team . . . SOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRISE DEFEAT . Nadim Ladki AL-AIN . . .Al Ain . . . CHINA IN SURPRISE DEFEAT . Nadim Ladki AL-AIN , United Arab Emirates 1996-12-06 Japan began the . . .

RELIC (Wiki)

RKC Waalwijk . . . division soccer match played on Friday : RKC Waalwijk 1 ( Starbuck 76 ) Willem II Tilburg 2 . . .Willem II (football club) . . . : RKC Waalwijk 1 ( Starbuck 76 ) Willem II Tilburg 2 ( Konterman 45 , Van der Vegt . . .PSV Eindhoven . . . SOCCER - PSV HIT VOLENDAM FOR SIX . AMSTERDAM 1996-12-07 . . .

RELIC (In-Domain)

English people . . . Mahindra International . The Australian brushed aside unseeded Englishman Mark Cairns 15-7 15-6 15-8 . Top-seededEyles . . .

England . . . 16-year-old who attended Sale Grammar School in the northern England city of Manchester died less than a day after. . .

England . . . a last-minute goal to salvage a 2-2 draw for English premier league leaders Arsenal at home to Derby on . . .

Table 7: Most Conflated Entities on AIDA using GRINCH Clustering. Left: Unique entity ID. Right: Mentionwith entity surface form in italics.

Page 14: Benchmarking Scalable Methods for Streaming Cross Document

4730

Feature-Based

Transcription, Genetic . . . of eight tissues. The Mef2c promoter had the higher transcriptional activity in differentiated C2C12 cells than that inproliferating C2C12 . . .

Transcriptional Activation . . . in proliferating C2C12 cells, which was accompanied by the up-regulation of mRNA expression of Mef2c gene.Function deletion and . . .

Regulation of Biological Process . . . (GPCRs) is a key event for cell signaling and regulation of receptor function. Previously, using tandem mass spec-trometry, we . . .

MLM

Left Ventricular Function . . . computed tomography angiography, we assessed 3 primary outcome measures: left ventricular (LV) systolic function(left ventricular ejection fraction), LV diastolic function (early relaxation . . .

Left Ventricular Ejection Fraction . . . 3 primary outcome measures: left ventricular (LV) systolic function ( left ventricular ejection fraction ), LV diastolicfunction (early relaxation velocity), and coronary atherosclerosis . . .

Endoscopy (Procedure) . . . Comprehensive Cancer Network (NCCN) guidelines, which are based on rigid endoscopic measurements. The medi-cal records of patients scheduled to receive . . .

BLINK (Wiki)

Medical Records . . . guidelines, which are based on rigid endoscopic measurements. The medical records of patients scheduled to receivecurative surgery for histologically . . .

Lower - Spatial Qualifier . . . with rectal cancer located in the upper (Ra) or lower (Rb) division using double-contrast barium enema. The medianvalues . . .

Asymptomatic (Finding) . . . two bladder urothelial cancer metastatic to the penis with no relevant clinical symptoms . Namely, a 69 years-old manwith a warthy lesions . . .

RELIC (Wiki)

Lysome-associated . . . . . . Herein, we demonstrated that Zn(2+) could induce deglycosylation of lysosome-associated membrane protein 1 and 2(LAMP-1 and LAMP-2), which primarily locate in . . .

Sialic Acid . . . . . . In this study, we set out to define how CD169(+) phagocytes contribute to neuroinflammation in MS. CD169 - diph-theria . . .

SMAD3 gene . . . Epigenome-wide analysis links SMAD3 methylation at birth to asthma in children of asthmatic . . .

RELIC (In-Domain)

Individual . . . Cardiovascular Toxicity of Illicit Anabolic-Androgenic Steroid Use Millions of individuals have used illicit anabolic-androgenic steroids (AAS), but the long-term . . .

Pharmicologic Substance . . . steroids (AAS), but the long-term cardiovascular associations of these drugs remain incompletely understood. Usinga cross-sectional cohort design, we . . .

Finding . . . complications occurred in the studied neonates. Based on these findings , IC - ECG -guided tip placement appears tobe . . .

Hybrid

Study . . . marker for activated phagocytes in inflammatory disorders. In this study , we set out to define how CD169(+) phago-cytes contribute . . .

Evaluation . . . to provide holistic end-of-life care and assisted in the overall assessment of palliative care patients, identifying areasthat might not . . .

Research Study . . . which is hardly visible in clinically applied CT-imaging. This experimental study investigates ten different PSI designsand their effect to . . .

Table 8: Most Conflated Entities on MedMentions using Greedy NN Clustering. Left: Unique entity ID. Right:Mention with entity surface form in italics.

MLM

Cardiovascular . . . Cardiovascular Toxicity of Illicit Anabolic-Androgenic Steroid Use Millions of individuals . . .Toxic Effect . . . Cardiovascular Toxicity of Illicit Anabolic-Androgenic Steroid Use Millions of individuals have . . .Steroids . . . Cardiovascular Toxicity of Illicit Anabolic-Androgenic Steroid Use Millions of individuals have used illicit anabolic-androgenic

steroids . . .

BLINK (Wiki)

Cardiovascular . . . Cardiovascular Toxicity of Illicit Anabolic-Androgenic Steroid Use Millions of individuals . . .Toxic Effect . . . Cardiovascular Toxicity of Illicit Anabolic-Androgenic Steroid Use Millions of individuals have . . .Steroids . . . Cardiovascular Toxicity of Illicit Anabolic-Androgenic Steroid Use Millions of individuals have used illicit anabolic-androgenic

steroids . . .

RELIC (Wiki)

Sialic Acid . . . . . . In this study, we set out to define how CD169(+) phagocytes contribute to neuroinflammation in MS. CD169 - diphtheria . . .USP17L2 Protein . . . DUB3 and USP7 de-ubiquitinating enzymes control replication inhibitor Geminin: molecular . . .USP7 Protein . . . DUB3 and USP7 de-ubiquitinating enzymes control replication inhibitor Geminin: molecular characterization and . . .

RELIC (In-Domain)

Protein Expression . . . by vascular endothelial growth factor (VEGF) signaling. We describe spatiotemporal expression of vegf and vegfr and experimentalmanipulations targeting VEGF . . .

Genes, Homeobox . . . cell adhesion, and newly identified processes, including transcription and homeobox genes . We identified mutations in proteinbinding sites correlating with . . .

Gene Expression . . . identified mutations in protein binding sites correlating with differential expression of proximal genes and experimentally validatedeffects of mutations . . .

Table 9: Most Conflated Entities on MedMentions using GRINCH Clustering. Left: Unique entity ID. Right:Mention with entity surface form in italics.

Page 15: Benchmarking Scalable Methods for Streaming Cross Document

4731

Feature-Based

Yu - Gi - Oh ! - Episode 004 . . . Yu - Gi - Oh ! - Episode 004 ” Into the Hornet ’ s Nest ” , known . . .Yu - Gi - Oh ! ZEXAL - Episode 082 . . . Yu - Gi - Oh ! ZEXAL - Episode 082 ” Sphere Cube Calamity : Part 1 ” , known . . .Yu - Gi - Oh ! Duelist - Duel 168 . . . Yu - Gi - Oh ! Duelist - Duel 168 ” The Waiting Grave ” , also known as ” . . .

MLM

Yami Yugi and Rafael ’ s first Duel . . . Yami Yugi and Rafael ’ s first Duel Yami Yugi goes to duel against Rafael as the message . . .Yu - Gi - Oh ! - Episode 004 . . . Yu - Gi - Oh ! - Episode 004 ” Into the Hornet ’ s Nest ” , known . . .Yu - Gi - Oh ! ZEXAL - Episode 082 . . . Yu - Gi - Oh ! ZEXAL - Episode 082 ” Sphere Cube Calamity : Part 1 ” , known . . .

BLINK (Wiki)

Robin ( Friends ) . . . , Sarah and Maya . Emma has a horse called Robin , dog called Lady and a cat called Jewel . . . .41003 Olivia ’ s Newborn Foal . . . a pet bird , Goldie . Olivia also has a new pet foal , which she takes care of frequently . She seems . . .41007 Heartlake Pet Salon . . . its neck . Background . Joanna brings her poodle to the pet salon , where Emma pampers her up . ¡ br ¿ . . .

RELIC (Wiki)

Ro Gale . . . , as was Maquis leader Macias . Ro recalled that her father made the strongest ” hasperat ” she ’ d ever . . .Unnamed shuttlepods ( 22nd century ) . . . . ” ( ) The Federation starship carried at least one shuttlepod until the time of its disappearance in the mid - . . .Founders ’ homeworld ( 2372 ) . . . As she is reluctant to reveal the location of the Founders ’ new homeworld , but respects Sisko ’ s loyalty to Odo

when . . .

RELIC (In-Domain)

Yu - Gi - Oh ! - Episode 004 . . . Yu - Gi - Oh ! - Episode 004 ” Into the Hornet ’ s Nest ” , known . . .Yu - Gi - Oh ! ZEXAL - Episode 082 . . . Yu - Gi - Oh ! ZEXAL - Episode 082 ” Sphere Cube Calamity : Part 1 ” , known . . .Yu - Gi - Oh ! Duelist - Duel 168 . . . Yu - Gi - Oh ! Duelist - Duel 168 ” The Waiting Grave ” , also known as ” . . .

Hybrid

Yu - Gi - Oh ! - Episode 004 . . . Yu - Gi - Oh ! - Episode 004 ” Into the Hornet ’ s Nest ” , known . . .Yu - Gi - Oh ! ZEXAL - Episode 082 . . . Yu - Gi - Oh ! ZEXAL - Episode 082 ” Sphere Cube Calamity : Part 1 ” , known . . .Yu - Gi - Oh ! Duelist - Duel 168 . . . Yu - Gi - Oh ! Duelist - Duel 168 ” The Waiting Grave ” , also known as ” . . .

Table 10: Most Conflated Entities on Zeshel using Greedy NN Clustering. Left: Unique entity ID. Right:Mention with entity surface form in italics.

MLM

Moondeep Sea . . . Larynda Telenna was the high priestess of Kiaransalee in the Vault of Gnashing Teeth beneath Vaasa . She wasalso the leader of Kiaransalee . . .

Tabaxi ( tribe ) . . . the Chultan Peninsula , consisting primarily of members of the Tabaxi tribe . Description . Chultans were tall andhad dark , . . .

New Velar . . . from the Moonsea Ride , as it would have connected Harrowdale Town with this major road . To avoid ambushesby the . . .

BLINK (Wiki)

Astral projection . . . also possible to escape with ” teleportation ” spells or astral travel , though the force blocked ethereal travel . Acaptive . . .

Krakentua ( Shinkintin ) . . . and force newly hatched krakentua spawn to fight . A krakentua related these events via dreams to adventurers inthe Fochu . . .

Generic temple guard . . . two to attempt a crossing were Father Sambar and a temple guard . Sambar died horrifically , but the guardsurvived as . . .

RELIC (Wiki)

Moondeep Sea . . . Larynda Telenna was the high priestess of Kiaransalee in the Vault of Gnashing Teeth beneath Vaasa . She wasalso the leader of Kiaransalee . . .

Tabaxi ( tribe ) . . . the Chultan Peninsula , consisting primarily of members of the Tabaxi tribe . Description . Chultans were tall andhad dark , . . .

Kaedlaw Burdun . . . the Silver Wyrm in 1369 DR , Queen Brianna , her newborn child , Avner , Tavis Burdun , and Basil retreated to. . .

RELIC (In-Domain)

Vetrix Family . . . Vetrix Family The Vetrix Family , known as the Tron Family in . . .Yu - Gi - Oh ! - Episode 004 . . . Yu - Gi - Oh ! - Episode 004 ” Into the Hornet ’ s Nest ” , known . . .Yu - Gi - Oh ! ZEXAL - Episode 082 . . . Yu - Gi - Oh ! ZEXAL - Episode 082 ” Sphere Cube Calamity : Part 1 ” , known . . .

Table 11: Most Conflated Entities on Zeshel using GRINCH Clustering. Left: Unique entity ID. Right: Men-tion with entity surface form in italics.