40
Proceedings of the Joint Conference on EMNLP and CoNLL: Shared Task, pages 1–40, Jeju Island, Korea, July 13, 2012. c 2012 Association for Computational Linguistics CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted Coreference in OntoNotes Sameer Pradhan Raytheon BBN Technologies, Cambridge, MA 02138 USA [email protected] Alessandro Moschitti University of Trento, 38123 Povo (TN) Italy [email protected] Nianwen Xue Brandeis University, Waltham, MA 02453 USA [email protected] Olga Uryupina University of Trento, 38123 Povo (TN) Italy [email protected] Yuchen Zhang Brandeis University, Waltham, MA 02453 USA [email protected] Abstract The CoNLL-2012 shared task involved pre- dicting coreference in English, Chinese, and Arabic, using the final version, v5.0, of the OntoNotes corpus. It was a follow-on to the English-only task organized in 2011. Un- til the creation of the OntoNotes corpus, re- sources in this sub-field of language process- ing were limited to noun phrase coreference, often on a restricted set of entities, such as the ACE entities. OntoNotes provides a large- scale corpus of general anaphoric coreference not restricted to noun phrases or to a spec- ified set of entity types, and covers multi- ple languages. OntoNotes also provides ad- ditional layers of integrated annotation, cap- turing additional shallow semantic structure. This paper describes the OntoNotes annota- tion (coreference and other layers) and then describes the parameters of the shared task in- cluding the format, pre-processing informa- tion, evaluation criteria, and presents and dis- cusses the results achieved by the participat- ing systems. The task of coreference has had a complex evaluation history. Potentially many evaluation conditions, have, in the past, made it difficult to judge the improvement in new algorithms over previously reported re- sults. Having a standard test set and stan- dard evaluation parameters, all based on a re- source that provides multiple integrated anno- tation layers (syntactic parses, semantic roles, word senses, named entities and coreference) and in multiple languages could support joint modeling and help ground and energize on- going research in the task of entity and event coreference. 1 Introduction The importance of coreference resolution for the entity/event detection task, namely identifying all mentions of entities and events in text and clustering them into equivalence classes, has been well recog- nized in the natural language processing community. Early work on corpus-based coreference resolu- tion dates back to the mid-90s by McCarthy and Lenhert (1995) where they experimented with deci- sion trees and hand-written rules. Corpora to support supervised learning of this task date back to the Mes- sage Understanding Conferences (MUC) (Hirschman and Chinchor, 1997; Chinchor, 2001; Chinchor and Sundheim, 2003). The de facto standard datasets for current coreference studies are the MUC and the ACE 1 (Doddington et al., 2004) corpora. These cor- pora were tagged with coreferring entities in the form of noun phrases in the text. The MUC corpora cover all noun phrases in text but are relatively small in size. The ACE corpora, on the other hand, cover much more data, but the annotation is restricted to a small subset of entities. Automatic identification of coreferring entities and events in text has been an uphill battle for sev- eral decades, partly because it is a problem that re- quires world knowledge to solve and word knowl- edge is hard to define, and partly owing to the lack of substantial annotated data. Aside from the fact that resolving coreference in text is simply a very hard problem, there have been other hindrances that further contributed to the slow progress in this area: (i) Smaller sized corpora such as MUC which cov- ered coreference across all noun phrases. Cor- pora such as ACE which are larger in size, but cover a smaller set of entities; and (ii) low consistency in existing corpora annotated with coreference — in terms of inter-annotator agreement (ITA) (Hirschman et al., 1998) — owing to attempts at covering multiple coref- erence phenomena that are not equally anno- tatable with high agreement which likely less- ened the reliability of statistical evidence in the form of lexical coverage and semantic related- ness that could be derived from the data and 1 http://projects.ldc.upenn.edu/ace/data/ 1

CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted ... · erence resolution was conducted on the MUC cor-pora, using a decision tree learner, by Soon et al. (2001). Signicant

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted ... · erence resolution was conducted on the MUC cor-pora, using a decision tree learner, by Soon et al. (2001). Signicant

Proceedings of the Joint Conference on EMNLP and CoNLL: Shared Task, pages 1–40,Jeju Island, Korea, July 13, 2012. c©2012 Association for Computational Linguistics

CoNLL-2012 Shared Task:Modeling Multilingual Unrestricted Coreference in OntoNotes

Sameer PradhanRaytheon BBN Technologies,

Cambridge, MA 02138USA

[email protected]

Alessandro MoschittiUniversity of Trento,

38123 Povo (TN)Italy

[email protected]

Nianwen XueBrandeis University,Waltham, MA 02453

[email protected]

Olga UryupinaUniversity of Trento,

38123 Povo (TN)Italy

[email protected]

Yuchen ZhangBrandeis University,Waltham, MA 02453

[email protected]

Abstract

The CoNLL-2012 shared task involved pre-dicting coreference in English, Chinese, andArabic, using the final version, v5.0, of theOntoNotes corpus. It was a follow-on to theEnglish-only task organized in 2011. Un-til the creation of the OntoNotes corpus, re-sources in this sub-field of language process-ing were limited to noun phrase coreference,often on a restricted set of entities, such asthe ACE entities. OntoNotes provides a large-scale corpus of general anaphoric coreferencenot restricted to noun phrases or to a spec-ified set of entity types, and covers multi-ple languages. OntoNotes also provides ad-ditional layers of integrated annotation, cap-turing additional shallow semantic structure.This paper describes the OntoNotes annota-tion (coreference and other layers) and thendescribes the parameters of the shared task in-cluding the format, pre-processing informa-tion, evaluation criteria, and presents and dis-cusses the results achieved by the participat-ing systems. The task of coreference hashad a complex evaluation history. Potentiallymany evaluation conditions, have, in the past,made it difficult to judge the improvement innew algorithms over previously reported re-sults. Having a standard test set and stan-dard evaluation parameters, all based on a re-source that provides multiple integrated anno-tation layers (syntactic parses, semantic roles,word senses, named entities and coreference)and in multiple languages could support jointmodeling and help ground and energize on-going research in the task of entity and eventcoreference.

1 IntroductionThe importance of coreference resolution for the

entity/event detection task, namely identifying allmentions of entities and events in text and clusteringthem into equivalence classes, has been well recog-nized in the natural language processing community.

Early work on corpus-based coreference resolu-tion dates back to the mid-90s by McCarthy andLenhert (1995) where they experimented with deci-sion trees and hand-written rules. Corpora to supportsupervised learning of this task date back to the Mes-sage Understanding Conferences (MUC) (Hirschmanand Chinchor, 1997; Chinchor, 2001; Chinchor andSundheim, 2003). The de facto standard datasetsfor current coreference studies are the MUC and theACE1 (Doddington et al., 2004) corpora. These cor-pora were tagged with coreferring entities in theform of noun phrases in the text. The MUC corporacover all noun phrases in text but are relatively smallin size. The ACE corpora, on the other hand, covermuch more data, but the annotation is restricted to asmall subset of entities.

Automatic identification of coreferring entitiesand events in text has been an uphill battle for sev-eral decades, partly because it is a problem that re-quires world knowledge to solve and word knowl-edge is hard to define, and partly owing to the lackof substantial annotated data. Aside from the factthat resolving coreference in text is simply a veryhard problem, there have been other hindrances thatfurther contributed to the slow progress in this area:

(i) Smaller sized corpora such as MUC which cov-ered coreference across all noun phrases. Cor-pora such as ACE which are larger in size, butcover a smaller set of entities; and

(ii) low consistency in existing corpora annotatedwith coreference — in terms of inter-annotatoragreement (ITA) (Hirschman et al., 1998) —owing to attempts at covering multiple coref-erence phenomena that are not equally anno-tatable with high agreement which likely less-ened the reliability of statistical evidence in theform of lexical coverage and semantic related-ness that could be derived from the data and

1http://projects.ldc.upenn.edu/ace/data/

1

Page 2: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted ... · erence resolution was conducted on the MUC cor-pora, using a decision tree learner, by Soon et al. (2001). Signicant

used by a classifier to generate better predic-tive models. The importance of a well-definedtagging scheme and consistent ITA has beenwell recognized and studied in the past (Poe-sio, 2004; Poesio and Artstein, 2005; Passon-neau, 2004). There is a growing consensus thatin order to take language understanding appli-cations such as question answering or distilla-tion to the next level, we need more consistentannotation for larger amounts of broad cover-age data to train better automatic models forentity and event detection.

(iii) Complex evaluation with multiple evaluationmetrics and multiple evaluation scenarios,complicated with varying training and testpartitions, led to situations where many re-searchers report results with only one or a fewof the available metrics and under a subset ofevaluation scenarios. This has made it hard togauge the improvement in algorithms over theyears (Stoyanov et al., 2009), or to determinewhich particular areas require further attention.Looking at various numbers reported in litera-ture can greatly affect the perceived difficultyof the task. It can seem to be a very hard prob-lem (Soon et al., 2001) or one that is relativelyeasy (Culotta et al., 2007).

(iv) the knowledge bottleneck which has been awell-accepted ceiling that has kept the progressin this task at bay.

These issues suggest that the following stepsmight take the community in the right direction to-wards improving the state of the art in coreferenceresolution:

(i) Create a large corpus with high inter-annotator agreement possibly by restrictingthe coreference annotating to phenomena thatcan be annotated with high consistency, andcovering an unrestricted set of entities andevents; and

(ii) Create a standard evaluation scenario with anofficial evaluation setup, and possibly severalablation settings to capture the range of perfor-mance. This can then be used as a standardbenchmark by the research community.

(iii) Continue to improve learning algorithms thatbetter incorporate world knowledge and jointlyincorporate information from other layers ofsyntactic and semantic annotation to improvethe state of the art.

One of the many goals of the OntoNotesproject2 (Hovy et al., 2006; Weischedel et al., 2011)

2http://www.bbn.com/nlp/ontonotes

was to explore whether it could fill this void and helppush the progress further — not only in coreference,but with the various layers of semantics that it triesto capture. As one of its layers, it has created acorpus for general anaphoric coreference that cov-ers entities and events not limited to noun phrasesor a subset of entity types. The coreference layerin OntoNotes constitutes just one part of a multi-layered, integrated annotation of shallow semanticstructures in text with high inter-annotator agree-ment. This addresses the first issue.

In the language processing community, the fieldof speech recognition probably has the longest his-tory of shared evaluations held primary by NIST3

(Pallett, 2002). In the past decade machine trans-lation has been a topic of shared evaluations alsoby NIST4. There are many syntactic and semanticprocessing tasks that are not quite amenable to suchcontinued evaluation efforts. The CoNLL sharedtasks over the past 15 years have filled that gap, help-ing establish benchmarks and advance the state ofthe art in various sub-fields within NLP. The impor-tance of shared tasks is now in full display in thedomain of clinical NLP (Chapman et al., 2011) andrecently a coreference task was organized as partof the i2b2 workshop (Uzuner et al., 2012). Thecomputational learning community is also witness-ing a shift towards joint inference based evaluations,with the two previous CoNLL tasks (Surdeanu et al.,2008; Hajic et al., 2009) devoted to joint learning ofsyntactic and semantic dependencies. A SemEval-2010 coreference task (Recasens et al., 2010) wasthe first attempt to address the second issue. Itincluded six different Indo-European languages —Catalan, Dutch, English, German, Italian, and Span-ish. Among other corpora, a small subset (∼120K)of English portion of OntoNotes was used for thispurpose. However, the lack of a strong participa-tion prevented the organizers from reaching any firmconclusions. The CoNLL-2011 shared task was an-other attempt to address the second issue. It was wellreceived, but the shared task was only limited to theEnglish portion of OntoNotes. In addition, the coref-erence portion of OntoNotes did not have a concretebaseline prior to the 2011 evaluation, thereby mak-ing it challenging for participants to gauge the per-formance of their algorithms in the absence of es-tablished state of the art on this flavor of annotation.The closest comparison was to the results reportedby Pradhan et al. (2007b) on the newswire portion ofOntoNotes. Since the corpus also covers two otherlanguages from completely different language fami-lies, Chinese and Arabic, it provided a great oppor-tunity to have a follow-on task in 2012 covering all

3http://www.itl.nist.gov/iad/mig/publications/ASRhistory/index.html4http://www.itl.nist.gov/iad/mig/tests/mt/

2

Page 3: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted ... · erence resolution was conducted on the MUC cor-pora, using a decision tree learner, by Soon et al. (2001). Signicant

three languages. As we will see later, peculiaritiesof each of these languages had to be considered increating the evaluation framework.

The first systematic learning-based study in coref-erence resolution was conducted on the MUC cor-pora, using a decision tree learner, by Soon et al.(2001). Significant improvements have been madein the field of language processing in general, andimproved learning techniques have pushed the stateof the art in coreference resolution forward (Mor-ton, 2000; Harabagiu et al., 2001; McCallum andWellner, 2004; Culotta et al., 2007; Denis andBaldridge, 2007; Rahman and Ng, 2009; Haghighiand Klein, 2010). Researchers have continued tofind novel ways of exploiting ontologies such asWordNet. Various knowledge sources from shallowsemantics to encyclopedic knowledge have been ex-ploited (Ponzetto and Strube, 2005; Ponzetto andStrube, 2006; Versley, 2007; Ng, 2007). Giventhat WordNet is a static ontology and as such haslimitation on coverage, more recently, there havebeen successful attempts to utilize information frommuch larger, collaboratively built resources such asWikipedia (Ponzetto and Strube, 2006). More re-cently researchers have used graph based algorithms(Cai et al., 2011a) rather than pair-wise classifica-tions. For a detailed survey of the progress in thisfield, we refer the reader to a recent article (Ng,2010) and a tutorial (Ponzetto and Poesio, 2009)dedicated to this subject. In spite of all the progress,current techniques still rely primarily on surfacelevel features such as string match, proximity, andedit distance; syntactic features such as apposition;and shallow semantic features such as number, gen-der, named entities, semantic class, Hobbs’ distance,etc. Further research to reduce the knowledge gap isessential to take coreference resolution techniques tothe next level.

The rest of the paper is organized as follows: Sec-tion 2 presents an overview of the OntoNotes cor-pus. Section 3 describes the range of phenomenaannotated in OntoNotes, and language-specific is-sues. Section 4 describes the shared task data andthe evaluation parameters, with Section 4.4.2 exam-ining the performance of the state-of-the-art toolson all/most intermediate layers of annotation. Sec-tion 5 describes the participants in the task. Sec-tion 6 briefly compares the approaches taken by var-ious participating systems. Section 7 presents thesystem results with some analysis. Section 8 com-pares the performance of the systems on the a subsetof the Engish test set that corresponds with the testset used for the CoNLL-2011 evaluation. Section 9draws some conclusions.

2 The OntoNotes CorpusThe OntoNotes project has created a large-scale

corpus of accurate and integrated annotation of mul-tiple levels of the shallow semantic structure in text.The English and Chinese language portion com-prises roughly one million words per language ofnewswire, magazine articles, broadcast news, broad-cast conversations, web data and conversationalspeech data. The English subcorpus also containsan additional 200K words of the English translationof the New Testament as Pivot Text. The Arabic por-tion is smaller, comprising 300K words of newswirearticles. The hope is that this rich, integrated an-notation covering many layers will allow for richer,cross-layer models and enable significantly betterautomatic semantic analysis. In addition to coref-erence, this data is also tagged with syntactic trees,propositions for most verb and some noun instances,partial verb and noun word senses, and 18 named en-tity types. Manual annotation of a large corpus withmultiple layers of syntax and semantic informationis a costly endeavor. Over the years in the devel-opment of this corpus, there were various prioritiesthat came into play, and therefore not all the data inthe corpus could be annotated with all the differentlayers of annotation. However, such multi-layer an-notations, with complex, cross-layer dependencies,demands a robust, efficient, scalable storage mech-anism while providing efficient, convenient, inte-grated access to the the underlying structure. Tothis effect, it uses a relational database representa-tion that captures both the inter- and intra-layer de-pendencies and also provides an object-oriented APIfor efficient, multi-tiered access to this data (Prad-han et al., 2007a). This facilitates the extraction ofcross-layer features in integrated predictive modelsthat will make use of these annotations.

OntoNotes comprises the following layers of an-notation:

• Syntax — A layer of syntactic annotation forEnglish, Chinese and Arabic based on a revisedguidelines for the Penn Treebank (Marcus etal., 1993; Babko-Malaya et al., 2006), the Chi-nese Treebank (Xue et al., 2005) and the ArabicTreebank (Maamouri and Bies, 2004).• Propositions — The proposition structure of

verbs based on revised guidelines for the En-glish PropBank (Palmer et al., 2005; Babko-Malaya et al., 2006), the Chinese PropBank(Xue and Palmer, 2009) and the Arabic Prop-Bank (Palmer et al., 2008; Zaghouani et al.,2010).• Word Sense — Coarse-grained word senses

are tagged for the most frequent polysemousverbs and nouns, in order to maximize token

3

Page 4: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted ... · erence resolution was conducted on the MUC cor-pora, using a decision tree learner, by Soon et al. (2001). Signicant

coverage. The word sense granularity is tai-lored to achieve 90% inter-annotator agreementas demonstrated by Palmer et al. (2007). Thesesenses are defined in the sense inventory files.In case of English and Arabic languages, thesense-inventories (and frame files) are definedseparately for each part of speech that is real-ized by the lemma in the text. For Chinese,however the sense inventories (and frame files)are defined per lemma — independent of thepart of speech realized in the text. For theEnglish portion of OntoNotes, each individualsense has been connected to multiple WordNetsenses. This provides users direct access to theWordNet semantic structure. There is also amapping from the OntoNotes word senses toPropBank frames and to VerbNet (Kipper etal., 2000) and FrameNet (Fillmore et al., 2003).Unfortunately, owing to lack of comparable re-sources as comprehensive as WordNet in Chi-nese or Arabic, neither language has any inter-resource mappings available.• Named Entities — The corpus was tagged

with a set of 18 well-defined proper named en-tity types that have been tested extensively forinter-annotator agreement by Weischedel andBurnstein (2005).• Coreference — This layer captures general

anaphoric coreference that covers entities andevents not limited to noun phrases or a lim-ited set of entity types (Pradhan et al., 2007b).It considers all pronouns (PRP, PRP$), nounphrases (NP) and heads of verb phrases (VP)as potential mentions. Unlike English, Chineseand Arabic have dropped subjects and objectswhich were also considered during coreferenceannotation5. We will take a look at this in detailin the next section.

3 Coreference in OntoNotesGeneral anaphoric coreference that spans a rich

set of entities and events — not restricted to a fewtypes, as has been characteristic of most coreferencedata available until now — has been tagged with ahigh degree of consistency in the OntoNotes corpus.Two different types of coreference are distinguished:Identity (IDENT), and Appositive (APPOS). Identitycoreference (IDENT) is used for anaphoric corefer-ence, meaning links between pronominal, nominal,and named mentions of specific referents. It does notinclude mentions of generic, underspecified, or ab-stract entities. Appositives (APPOS) are treated sep-arately because they function as attributions, as de-scribed further below. Coreference is annotated forall specific entities and events. There is no limit on

5As we will see later these are not used during the task.

the semantic types of NP entities that can be consid-ered for coreference, and in particular, coreferenceis not limited to ACE types. The guidelines are fairlylanguage independent. We will look at some salientaspects of the coreference annotation in OntoNotes.For more details, and examples, we refer the readerto the release documentation. We will primarily useEnglish examples to describe various aspects of theannotation and use Chinese and Arabic examples es-pecially to illustrate phenomena not observed in En-glish, or that have some language specific peculiari-ties.3.1 Noun Phrases

The mentions over which IDENT coreference ap-plies are typically pronominal, named, or definitenominal. The annotation process begins by automat-ically extracting all of the NP mentions from parsetrees in the syntactic layer of OntoNotes annotation,though the annotators can also add additional men-tions when appropriate. In the following two exam-ples (and later ones), the phrases in bold form thelinks of an IDENT chain.

(1) She had a good suggestion and it was unani-mously accepted by all.

(2) Elco Industries Inc. said it expects net incomein the year ending June 30, 1990, to fall below arecent analyst’s estimate of $ 1.65 a share. TheRockford, Ill. maker of fasteners also said itexpects to post sales in the current fiscal yearthat are “slightly above” fiscal 1989 sales of $155 million.

Noun phrases (NPs) in Chinese can be complexnoun phrases or bare nouns (nouns that lack a de-terminer such as “the” or “this”). Complex nounphrases contain structures modifying the head noun,as in the following examples:

(3) (他担任总统任内最后一次 的 (亚太经济合作会议 (高峰会))).((His last APEC (summit meeting)) as thePresident)

(4) (越南 统一 后 (第一 位 前往 当地 访问 的(美国总统)))((The first (U.S. president)) who went to visitVietnam after its unification)

In these examples, the smallest phrase in paren-theses is the bare noun. The longer phrase in paran-theses includes modifying structures. All the expres-sions in the parantheses, however, share the samehead noun, i.e., “高峰会 (summit meeting)”, and“美国总统 (U.S. president)” respectively. Nestednoun phrases, or nested NPs, are contained within

4

Page 5: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted ... · erence resolution was conducted on the MUC cor-pora, using a decision tree learner, by Soon et al. (2001). Signicant

longer noun phrases. In the above example, “sum-mit meeting” and “U.S. president” are nested NPs.Wherever NPs are nested, the largest logical span isused in coreference.

3.2 VerbsVerbs are added as single-word spans if they can

be coreferenced with a noun phrase or with anotherverb. The intent is to annotate the VP, but the single-word verb head is marked for convenience. Thisincludes morphologically related nominalizations asin (5) and noun phrases that refer to the same event,even if they are lexically distinct from the verb as in(6). In the following two examples, only the chainsrelated to the growth event are shown in bold. TheArabic translation of the same example identifiesmentions using parantheses.

(5) The European economy grew rapidly over thepast years, this growth helped raising ....�

H@ñJ�Ë@ ÈC

g

�é«Qå��. ú

G.ðPð

B@ XA�

�J�¯B

@ ( AÖ

ß ) Y

�®Ë

. . . ©P ú

¯ ÑëA� ( ñÒ

JË @ @

Yë ) ,

�éJ

�AÖÏ @

(6) Japan’s domestic sales of cars, trucks and busesin October rose 18% from a year earlier to500,004 units, a record for the month, the JapanAutomobile Dealers’ Association said. Thestrong growth followed year-to-year increasesof 21% in August and 12% in September.

3.3 PronounsAll pronouns and demonstratives are linked to

anything that they refer to, and pronouns in quotedspeech are also marked. Expletive or pleonastic pro-nouns (it, there) are not considered for tagging, andgeneric you is not marked. In the following exam-ple, the pronoun you and it would not be marked. (Inthis and following examples, an asterisk (*) before aboldface phrase identifies entity/event mentions thatwould not be tagged in the coreference annotation.)

(7) Senate majority leader Bill Frist likes to tella story from his days as a pioneering heartsurgeon back in Tennessee. A lot of times,Frist recalls, *you’d have a critical patient ly-ing there waiting for a new heart, and *you’dwant to cut, but *you couldn’t start unless *youknew that the replacement heart would make*it to the operating room.

In Chinese, all the following pronouns — 你,我,他, 她, 它, 你们,我们,他们,它们,我, 您, 咱们 (you, me, he, she, and so on), anddemonstrative pronouns —这个,那个,这些,那些 (this, that, these, those) in singular, plural or pos-sessive forms are linked to anything they refer to.

Pronouns from classical Chinese such as 其 中(among which),其 (he/she/it),之 (he/she/it) are alsolinked with other mentions to which they refer.

In Arabic, the following pronouns are corefer-enced – nominative personal pronouns (subject) anddemonstrative pronouns which are detached. Sub-ject pronouns are often null in Arabic; overt subjectpronouns are rare, but do occur.

á��K @ / Õ

�æ

K @ / AÒ

�JK @ / ám�

' / áë / Ñë / AÒë

(We, you, they)

ùë / ñë /

�I

K@ / A

K @

(I, you, she, he)

Object pronouns are attached to the verb (directobjects) or preposition (indirect objects)

Aë / é� / ¼ / ø(Me, you, him, her)

á�ë / Ñ

�ë / AÒ

�» / á

�» / Õ

�» / A

K

(Us, you, them)

and, possessive (adjectival) pronouns are identicalto object pronouns, but are attached to nouns.

Aë / é� / ¼ / ø(My, your, his, her)

á�ë / Ñ

�ë / AÒ

�» / á

�» / Õ

�» / A

K

(Our, your, their)

Pronouns such as你,您,你们,大家,各位can be considered generic. In this case, they are notlinked to other generic mentions in the same dis-course. For example,

(8) 请 *大大大家家家带好自己的随身物品。*大家请下车。Please take your belongings with *you. Pleaseget off the train, *everyone.

In Chinese, if the subject or object can be recov-ered from the context, or if it is of little interest forthe reader/listener to know, it can be omitted. Inthe Chinese Treebank, a small *pro* in inserted inpositions where the subject or object is omitted. A*pro* can be replaced by overt NPs if they refer tothe same entity or event, and the *pro* and its overtNP antecedent do not have to be in the same sen-tence. Exactly what *pro* stands for is determinedby the linguistic context in which it appears.

(9) 吉林省主管经贸工作的副省长全哲洙说:“(*pro*) 欢迎国际社会同(我我我们们们) 一一一道道道,,,共共共同同同推进图门江开发事业,促进区域经济发展,造福东北亚人民。Quan Zhezhu, Vice Governor of JinlinProvince who is in charge of economics andtrade, said: “(*pro*) Welcome internationalsocieties to join (us) in the development of Tu-men Jiang, so as to promote regional economicdevelopment and benefit people in NortheastAsia.

5

Page 6: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted ... · erence resolution was conducted on the MUC cor-pora, using a decision tree learner, by Soon et al. (2001). Signicant

Sometimes, *pro*s cannot be recovered in thetext—i.e., an overt NP cannot be identified as theirantecedent in the same text — and therefore they arenot linked. For instance, the *pro* in existential sen-tences usually cannot be recovered or linked in theannotation, as in the following example:

(10) (*pro*) 有二十三顶高新技术项目进区开发。There are 23 high-tech projects under develop-ment in the zone.

Also, if *pro* does not refer to a specific entity orevent, it is considered generic *pro* and not linkedas in (11).

(11) 肯德基 、 麦当劳 等 速食店 全 大陆 都 推出了 (*pro*)买套餐赠送布质或棉质圣诞老人玩具的促销.In Mainland China, fast food restaurants suchas Kentucky Fried Chicken and McDonald’shave launched their promotional packages byproviding free cotton Santa toys for eachcombo (*pro*) purchased.

Finally, *pro*s in idiomatic expressions are notlinked. Similar to Chinese, Arabic null subjects andobjects are also eligible for coreference and treatedsimilarly. In the Arabic Treebank, these are markedwith just an “*”. There exists few of these instancesin English — marked (yet differently) with a *PRO*in the treebank and which are connected in Prop-Bank annotation but not in coreference.3.4 Generic mentions

Generic nominal mentions can be linked with re-ferring pronouns and other definite mentions, but notwith other generic nominal mentions.

This would allow linking of the bolded mentionsin (12) and (13), but not in (14).

(12) Officials said they are tired of making thesame statements.

(13) Meetings are most productive when they areheld in the morning. Those meetings, how-ever, generally have the worst attendance.

(14) Allergan Inc. said it received approval tosell the PhacoFlex intraocular lens, the firstfoldable silicone lens available for *cataractsurgery. The lens’ foldability enables it to beinserted in smaller incisions than are now pos-sible for *cataract surgery.

Bare plurals, as in (12) and (13), are always con-sidered generic. In example (15) below, there arethree generic instances of parents. These are markedas distinct IDENT chains (with separate chains dis-tinguished by subscripts X, Y and Z), each contain-ing a generic and the related referring pronouns.

(15) ParentsX should be involved with theirXchildren’s education at home, not in school.TheyX should see to it that theirX kids don’tplay truant; theyX should make certain thatthe children spend enough time doing home-work; theyX should scrutinize the report card.ParentsY are too likely to blame schools for theeducational limitations of theirY children. IfparentsZ are dissatisfied with a school, theyZshould have the option of switching to another.

In (16) below, the verb “halve” cannot be linked to“a reduction of 50%”, since “a reduction” is indefi-nite.

(16) Argentina said it will ask creditor banks to*halve its foreign debt of $64 billion — thethird-highest in the developing world . Ar-gentina aspires to reach *a reduction of 50%in the value of its external debt.

3.5 Pre-modifiersProper pre-modifiers can be coreferenced, but

proper nouns that are in a morphologically adjectivalform are treated as adjectives, and are not corefer-enced. For example, adjectival forms of GPEs suchas Chinese in “the Chinese leader”, would not belinked. Thus we could coreference United States in“the United States policy” with another referent, butnot American in “the American policy.” GPEs andNationality acronyms (e.g. U.S.S.R. or U.S.). arealso considered adjectival. Pre-modifier acronymscan be coreferenced unless they refer to a national-ity. Thus in the examples below, FBI can be corefer-enced to other mentions, but U.S. cannot.

(17) FBI spokesman

(18) *U.S. spokesman

In Chinese adjectival and nominal forms of GPEsare not morphologically distinct, and in such casesthe annotator decides whether it is an adjectival us-age. Usually if something is tagged as NORP then itis not considered as a mention.

Dates and monetary amounts can be consideredpart of a coreference chain even when they occur aspre-modifiers.

(19) The current account deficit on France’s balanceof payments narrowed to 1.48 billion Frenchfrancs ($236.8 million) in August from a re-vised 2.1 billion francs in July, the FinanceMinistry said. Previously, the July figure wasestimated at a deficit of 613 million francs.

(20) The company’s $150 offer was unexpected.The firm balked at the price.

6

Page 7: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted ... · erence resolution was conducted on the MUC cor-pora, using a decision tree learner, by Soon et al. (2001). Signicant

3.6 Copular verbsAttributes signaled by copular structures are not

marked; these are attributes of the referent they mod-ify, and their relationship to that referent will be cap-tured through word sense and proposition annota-tion.

(21) JohnX is a linguist. PeopleY are nervousaround JohnX, because heX always correctstheirY grammar.

Copular (or ’linking’) verbs are those verbs thatfunction as a copula and are followed by a subjectcomplement. Some common copular verbs are: be,appear, feel, look, seem, remain, stay, become, endup, get. Subject complements following such verbsare considered attributes and are not linked. SinceCalled is copular, neither IDENT nor APPOS corefer-ence is marked in the following case.

(22) Called Otto’s Original Oat Bran Beer, the brewcosts about $12.75 a case.

Some examples of copular verbs in Chinese are是 (to be) and 为 (to be, to serve as). In addition,other verbs (particularly so-called light verbs) thattrigger an attributive reading on the following NP:成为 (become), (当)选为 (is elected),称为 (is called),(好)像 (looks like),叫做 (is called), etc.

(23) (上海海海)是是是*(中中中国国国最最最大大大的的的城城城市市市)。。。(上上上海海海)发发发展展展得得得很很很快快快。。。(Shanghai) is *(the largest city in China).(Shanghai) develops fast.

In the above example, the two mentions of 上上上海海海 (Shanghai) co-refer with each other, but the en-tity does not co-refer with 中中中国国国最最最大大大的的的城城城 市市市 (thelargest city in China).

3.7 Small clausesLike copulas, small clause constructions are not

marked as coreferent. The following example istreated as if the copula were present (“John consid-ers Fred to be an idiot”):

(24) John considers *Fred *an idiot.

Note that the mention Fred, however, can be con-nected to other mentions of Fred in the text.

3.8 Temporal expressionsTemporal expressions such as the following are

linked:

(25) John spent three years in jail. In that time...

Deictic expressions such as now, then, today, tomor-row, yesterday, etc. can be linked, as well as othertemporal expressions that are relative to the time ofthe writing of the article, and which may thereforerequire knowledge of the time of the writing to re-solve the coreference. Annotators were allowed touse knowledge from outside the text in resolvingthese cases. In the following example, the end ofthis period and that time can be coreferenced, as canthis period and from three years to seven years.(26) The limit could range from three years to

seven yearsX, depending on the compositionof the management team and the nature of itsstrategic plan. At (the end of (this period)X)Y,the poison pill would be eliminated automati-cally, unless a new poison pill were approvedby the then-current shareholders, who wouldhave an opportunity to evaluate the corpora-tion’s strategy and management team at thattimeY.

In multi-date temporal expressions, embeddeddates are not separately connected to other mentionsof that date. For example in Nov. 2, 1999, Nov.would not be linked to another instance of Novemberlater in the text.3.9 Appositives

Because they logically represent attributions, ap-positives are tagged separately from Identity coref-erence. They consist of a head, or referent (a nounphrase that points to a specific object/concept in theworld), and one or more attributes of that referent.An appositive construction contains a noun phrasethat modifies an immediately-adjacent noun phrase(separated only by a comma, colon, dash, or paren-thesis). It often serves to rename or further definethe first mention. Marking appositive constructionsallows capturing the attributed property even thoughthere is no explicit copula.

(27) Johnhead, a linguistattribute

The head of each appositive construction is distin-guished from the attribute according to the followingheuristic specificity scale, in a decreasing order fromtop to bottom:

Type Example

Proper noun JohnPronoun HeDefinite NP the manIndefinite specific NP a man I knowNon-specific NP man

This leads to the following cases:

(28) Johnhead, a linguistattribute

(29) A famous linguistattribute, hehead studied at ...

7

Page 8: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted ... · erence resolution was conducted on the MUC cor-pora, using a decision tree learner, by Soon et al. (2001). Signicant

Type Description

Annotator Error An annotator error. This is a catch-all category for cases of errors that do not fit in the othercategories.

Genuine Ambiguity This is just genuinely ambiguous. Often the case with pronouns that have no clear an-tecedent (especially this & that)

Generics One person thought this was a generic mention, and the other person didn’tGuidelines The guidelines need to be clear about this exampleCallisto Layout Something to do with the usage/design of CallistoReferents Each annotator thought this was referring to two completely different thingsPossessives One person did not mark this possessiveVerb One person did not mark this verbPre Modifiers One person did not mark this Pre ModifierAppositive One person did not mark this appositiveCopula Disagreement arose because this mention is part of a copular structure

a) Either each annotator marked a different half of the copulab) Or one annotator unnecessarily marked both

Figure 1: Description of various disagreement types.

Figure 1: The distribution of disagreements across the various types in Table 2

Sheet1

Page 1

Copulae 2%Appositives 3%Pre Modifiers 3%Verbs 3%Possessives 4%Referents 7%Callisto Layout 8%Guidelines 8%Generics 11%Genuine Ambiguity 25%Annotator Error 26%

Copulae

Appositives

Pre Modifiers

Verbs

Possessives

Referents

Callisto Layout

Guidelines

Generics

Genuine Ambiguity

Annotator Error

0% 5% 10% 15% 20% 25% 30%

Figure 2: The distribution of disagreements across the various types in Table 1 for a sample of 15K disagreements inthe English portion of the corpus.

(30) a principal of the firmattribute, J. Smithhead

In cases where the two members of the appositiveare equivalent in specificity, the left-most member ofthe appositive is marked as the head/referent. Defi-nite NPs include NPs with a definite marker (the) aswell as NPs with a possessive adjective (his). Thusthe first element is the head in all of the followingcases:

(31) The chairman, the man who never gives up

(32) The sheriff, his friend

(33) His friend, the sheriff

In the specificity scale, specific names of diseasesand technologies are classified as proper names,whether they are capitalized or not.

(34) A dangerous bacteria, bacillium, is found

When the entity to which an appositive refers isalso mentioned elsewhere, only the single span con-taining the entire appositive construction is includedin the larger IDENT chain. None of the nested NP

spans are linked. In the example below, the en-tire span can be linked to later mentions to RichardGodown.

The sub-spans are not included separately in theIDENT chain.

(35) Richard Godown, president of the Indus-trial Biotechnology Association

Ages are tagged as attributes (as if they were el-lipses of, for example, a 42-year-old):

(36) Mr.Smithhead, 42attribute,

Similar rules apply for Chinese and Arabic. Un-like English, where most appositives have a punctu-ation marker, in Chinese that is not necessarily thefrequent case. In the following example we can seean appositive construction without any punctuationsbetween the head and the attribute.

(37) 上上上图图图 左左左 起起起 ::: (无无无锡锡锡市市市市市市长长长)X[attribute]

(王王王宏宏宏民民民)X[head],,, (副副副市市市长长长)Y[attribute]

(洪洪洪锦锦锦、、、张张张怀怀怀西西西)Y[head],,, ...

8

Page 9: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted ... · erence resolution was conducted on the MUC cor-pora, using a decision tree learner, by Soon et al. (2001). Signicant

Language Genre A1-A2 A1-ADJ A2-ADJ

English Newswire [NW] 80.9 85.2 88.3Broadcast News [BN] 78.6 83.5 89.4Broadcast Conversation [BC] 86.7 91.6 93.7Magazine [MZ] 78.4 83.2 88.8Weblogs and Newsgroups [WB] 85.9 92.2 91.2Telephone Conversation [TC] 81.3 94.1 84.7Pivot Text [PT] (New Testament) 89.4 96.0 92.0

Chinese Newswire [NW] 73.6 84.8 75.1Broadcast News [BN] 80.5 86.4 91.6Broadcast Conversation [BC] 84.1 90.7 91.2Magazine [MZ] 74.9 81.2 80.0Weblogs and Newsgroups [WB] 87.6 92.3 93.5Telephone Conversation [TC] 65.6 86.6 77.1

Table 1: Inter Annotator (A1 and A2) and Adjudicator(ADJ) agreement for the Coreference Layer in OntoNotesmeasured in terms of the MUC score.

Figure above from left : WuxiMayorX[attribute] Wang HongminX[head],Deputy MayorsY[attribute] Hong Jin, ZhangHuaixiY[head], ...

3.10 Special IssuesIn addition to the ones above, there are some spe-

cial cases such as:

• No coreference is marked between an organi-zation and its members.• GPEs are linked to references to their govern-

ments, even when the references are nestedNPs, or the modifier and head of a single NP.• In extremely rare cases, metonymic mentions

can be co-referenced. This is done only whenthe two mentions clearly and without a doubtrefer to the same entity. For example:

(38) In a statement released this afternoon, 10Downing Street called the bombings inCasablanca “a strike against all peace-loving people.”

(39) In a statement, Britain called theCasablanca bombings “a strike against allpeace-loving people.”

In this case, it is obvious that “10 Down-ing Street” and “Britain” are being used inter-changeably in the text. Again, if there is anyambiguity, however, these terms are not coref-erenced with each other.• In Arabic, verbal inflections are not considered

pronominal and are not coreferenced. The por-tion marked with an * in the example below isan inflection and not a pronoun, and so shouldnot be marked.

(40) �éJk. PA

mÌ'@

�èP@ Pð Õæ�

�@ H.

�é�®£A

JË @ ( �

H* ) hQå�

�I��Ë ( Aë )

à@

�@ : É

¯ñ

�J� CJ�

K @X

�éKQå��ñ�Ë@

­J

Jk. ú

¯ B ð

àQK. ú

¯

The Swiss foreign ministry’sspokeswoman announced the (she) isneither in Burne nor in Geneva Pronounsin quoted speech are also marked.

3.11 Annotator Agreement and AnalysisTable 1 shows the inter-annotator and annotator-

adjudicator agreement on all the genres and lan-guages of OntoNotes. A 15K disagreements in var-ious parts of the English data was analyzed, andgrouped into one of the categories shown in Figure1. Figure 2 shows the distribution of these differ-ent types that were found in that sample. It can beseen that genuine ambiguity and annotator error arethe biggest contributors — the latter of which is usu-ally captured during adjudication, thus showing theincreased agreement between the adjudicated ver-sion and the individual annotator version. Interest-ingly, this mirrors the annotator disagreement analy-sis on the MUC corpus provided by Hirschman et al.(1998).

4 CoNLL-2012 Coreference TaskThe CoNLL-2012 shared task was held across all

three languages — English, Chinese and Arabic —of the OntoNotes v5.0 data. The task was to auto-matically identify mentions of entities and events intext and to link the coreferring mentions together toform entity/event chains. The coreference decisionshad to be made using automatically predicted infor-mation on other structural and semantic layers in-cluding the parses, semantic roles, word senses, andnamed entities. Given various factors, such as thelack of resources and state-of-the-art tools, and timeconstraints, we could not provide some layers of in-formation for the Chinese and Arabic portion of thedata.

The three languages are from quite different lan-guage families. The morphology of these languagesis quite different. Arabic has a complex morphol-ogy, English has limited morphology, whereas Chi-nese has very little morphology. English word seg-mentation amounts to rule-based tokenization, andis close to perfect. In the case of Chinese and Ara-bic, although the tokenization/segmentation is notas good as English, the accuracies are in the high90s. Syntactically, there are many dropped subjectsand objects in Arabic and Chinese, whereas Englishis not a pro-drop language. Another difference isthe amount of resources available for each language.English has probably the most resources at its dis-posal, whereas Chinese and Arabic lack significantly

9

Page 10: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted ... · erence resolution was conducted on the MUC cor-pora, using a decision tree learner, by Soon et al. (2001). Signicant

— Arabic more so than Chinese. Given this fact,plus the fact that the CoNLL format cannot handlemultiple segmentations, and that it would compli-cate scoring since we are using exact token bound-aries (as discussed later in Section 4.5), we decidedto allow the use of gold, treebank segmentation forall languages. In the case of Chinese, the wordsthemselves are lemmas, so no additional informa-tion needs to be provided. For Arabic, by defaultwritten text is unvocalised, so we decided to alsoprovide correct, gold standard lemmas, along withthe correct vocalized version of the tokens. Table 2lists which layers were available and quality of theprovided layers (when provided.)

Layer English Chinese Arabic

Segmentation • • •Lemma

√— •

Parse√ √ √6

Proposition√ √ ×

Predicate Frame√ × ×

Word Sense√ √ √

Name Entities√ × ×

Speaker • • —

Table 2: Summary of predicted layers provided for eachlanguage. A “•” indicates gold annotation, a “

√” indi-

cates predicted, a “×” indicates an absence of the pre-dicted layer, and a “—” indicates that the layer is not ap-plicable to the language.

As is customary for CoNLL tasks, there were twoprimary tracks — closed and open. For the closedtrack, systems were limited to using the distributedresources, in order to allow a fair comparison of al-gorithm performance, while the open track allowedfor almost unrestricted use of external resources inaddition to the provided data. Within each closedand open track, we had an optional supplementarytrack which allowed us to run some ablation studiesover a few different input conditions. This allowedus to evaluate the systems given: i) Gold mentionboundaries (GB), ii) Gold mentions (GM), and iii)Gold parses (GS). We will refer to the main task –where no mention boundaries are provided – as NB.

4.1 Primary EvaluationThe primary evaluation comprises the closed and

open tracks where predicted information is providedon all layers of the test set other than coreference. Asmentioned earlier, we provide gold lemma and vo-calization information for Arabic, and we use goldstandard treebank segmentation for all three lan-guages.

6The predicted part of speech for Arabic are a mapped downversion of the richer gold version present in the treebank

4.1.1 Closed TrackIn the closed track, systems were limited to the

provided data. For the training and test data, inaddition to the underlying text, predicted versionsof all the supplementary layers of annotation wereprovided using off-the-shelf tools (parsers, semanticrole labelers, named entity taggers, etc.) retrained onthe training portion of the OntoNotes data — as de-scribed in Section 4.4.2. For the training data, how-ever, in addition to predicted values for the other lay-ers, we also provided manual, gold-standard anno-tations for all the layers. Participants were allowedto use either the gold-standard or predicted annota-tion to train their systems. They were also free touse the gold-standard data to train their own modelsfor the various layers of annotation, if they judgedthat those would either provide more accurate pre-dictions or alternative predictions for use as multipleviews, or if they wished to use a lattice of predic-tions.

More so than previous CoNLL shared tasks,coreference predictions depend on world knowl-edge, and many state-of-the-art systems use infor-mation from external resources such as WordNet,which provides a layer of information that couldhelp a system recognize semantic connections be-tween the various lexicalized mentions in the text.Therefore, in the case of English, similar to the pre-vious year’s task, we allowed the use of WordNet inthe closed track. Since word senses in OntoNotesare predominantly7 coarse-grained groupings ofWordNet senses, systems could also map from thepredicted or gold-standard word senses to the setsof underlying WordNet senses. Another significantpiece of knowledge that is particularly useful forcoreference but that is not available in the layers ofOntoNotes is that of number and gender. There aremany different ways of predicting these values, withdiffering accuracies, so in order to ensure that par-ticipants in the closed track were working from thesame data, thus allowing clearer algorithmic com-parisons, we specified a particular table of numberand gender predictions generated by Bergsma andLin (2006), for use during both training and test-ing. Unfortunately neither Arabic, nor Chinese havecomparable resources available that we could allowparticipants to use. Chinese, in particular, does nothave number or gender inflections for nouns, but(Baran and Xue, 2011) look at a way to infer suchinformation.

4.1.2 Open TrackIn addition to resources available in the closed

track, in the open track, systems were allowed to use

7There are a few instances of novel senses introduced inOntoNotes which were not present in WordNet, and so lack amapping back to the WordNet senses

10

Page 11: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted ... · erence resolution was conducted on the MUC cor-pora, using a decision tree learner, by Soon et al. (2001). Signicant

Algorithm 1 Procedure used to create OntoNotes training, developmentand test partitions.Procedure: Generate Partitions(OntoNotes) returns Train, Dev, Test

1: Train ← ∅2: Dev ← ∅3: Test ← ∅4: for all Source ∈ OntoNotes do5: if Source = Wall Street Journal then6: Train ← Train ∪ Sections 02 – 217: Dev ← Dev ∪ Sections 00, 01, 22, 248: Test ← Test ∪ Section 239: else10: if Number of files in Source ≥ 10 then11: Train ← Train ∪ File IDs ending in 1 – 812: Dev ← Dev ∪ File IDs ending in 013: Test ← Test ∪ File IDs ending in 914: else15: Dev ← Dev ∪ File IDs ending in 016: Test ← Test ∪ File ID ending in the highest number17: Train ← Train ∪ Remaining File IDs for the Source18: end if19: end if20: end for21: return Train, Dev, Test

1

external resources such as Wikipedia, gazetteers etc.The purpose of this track is mainly to get an ideaof the performance ceiling on the task at the cost ofnot being able to perform a fair comparison acrossall systems. Another advantage of the open track isthat it might reduce the barriers to participation byallowing participants to field existing research sys-tems that already depend on external resources —especially if there were hard dependencies on theseresources — so they can participate in the task withminimal, or no modification to their existing system.

4.2 Supplementary EvaluationIn addition to the option of selecting between the

primary closed or the open tracks, the participantsalso had an option to run their systems in the follow-ing ablation settings.

Gold Mention Boundaries (GB) In this case, weprovided all possible correct mention boundaries inthe test data. This essentially entails all NPs, andPRPs in the data extracted from the gold parse trees,as well as the mentions that do not align with anyparse constituent, for example, non-existent con-stituents in the predicted parse owing to errors, somenamed entities, etc.

Gold Mentions (GM) In this dataset, we providedonly and all the correct mentions for the test sets,thereby reducing the task to one of pure mentionclustering, and eliminating the task of mention de-

tection and anaphoricity determination8. These alsoinclude potential spans that do not align with anyconstituent in the predicted parse tree.

Gold Parses (GS) In this case, for each language,we replaced the predicted parses in the closed trackdata with manual, gold parses.

4.3 Train, Development and Test SplitsFor various reasons, not all the documents in

OntoNotes have been annotated with all the differentlayers of annotation, with full coverage.9 There is acore portion, however, which is roughly 1.6M En-glish words, 950K Chinese words, and 300K Arabicwords which has been annotated with all the layers.This is the portion that we used for the shared task.

We used the same algorithm as in CoNLL-2011 to8Mention detection interacts with anaphoricity determina-

tion since the corpus does not contain any singleton mentions.9As mentioned earlier, large scale manual annotation of var-

ious layers of syntax and semantics is an expensive endeavor.Adding to this, the fact that word sense annotation is most ef-ficiently done one lemma at a time, ideally all instances of thesame across the entire corpus, or as large a portion as possi-ble, full coverage across all lemma instances is hard to achievegiven the long tail of low frequency lemmas with a Zipfian dis-tribution. Similar issue affects PropBank annotation, but fur-thermore, currently it only covers mostly verb predicates, and afew eventive noun predicates.

10http://projects.ldc.upenn.edu/ace/data/11These numbers are for the part of OntoNotes v5.0 that have

all layers of annotation including coreferenced.

11

Page 12: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted ... · erence resolution was conducted on the MUC cor-pora, using a decision tree learner, by Soon et al. (2001). Signicant

Corpora Language Words DocumentsTotal Train Dev Test Total Train Dev Test

MUC-6 English 25K 12K 13K 60 30 30MUC-7 English 40K 19K 21K 67 30 37ACE10(2000-2004) English 960K 745K 215K - - -

Chinese 615K 455K 150K - - -Arabic 500K 350K 150K - - -

OntoNotes11 English 1.6M 1.3M 160K 170K 2,384(3493)

1,940(2,802)

222(343)

222(348)

Chinese 950K 750K 110K 90K 1,729(2,280)

1,391(1,810)

172(252)

166(218)

Arabic 300K 240K 30K 30K 447(447)

359(359)

44(44)

44(44)

Table 3: Number of documents in the OntoNotes v5.0 data, and some comparison with the MUC and ACE data sets. Thenumbers in parenthesis for the OntoNotes corpus indicate the total number of parts that correspond to the documents.Each part was considered a separate document for evaluation purposes.

create the train/development/test partitions for En-glish, Chinese and Arabic. We tried to reuse pre-viously established partitions for Chinese and Ara-bic, but either they were not in the selection used forOntoNotes, or were partially overlapping, or had avery small portion of OntoNotes covered in the testset. Unfortunately, unlike English WSJ partitions,there was no clean way of reusing those partitions.Algorithm 1 details this procedure. The list of train-ing/development/test document IDs can be found onthe task webpage12. Following the recent CoNLLtradition, participants were allowed to use both thetraining and the development data to train their finalmodel(s).

The number of documents in the corpus for thistask, for each of the different languages, and foreach of the training/development/test portions, areshown in Table 3. For comparison purposes, it alsolists the number of documents in the MUC-6, MUC-7, and ACE (2000-2004) corpora. The MUC-6 datawas taken from the Wall Street Journal, whereas theMUC-7 data was from the New York Times. The ACEdata spanned many different languages and genressimilar to the ones in OntoNotes. In fact, there issome overlap between ACE and OntoNotes sourcedocuments.

4.4 Data PreparationThis section gives details of the different annota-

tion layers including the automatic models that wereused to predict them, and describes the formats inwhich the data was provided to the participants.

12http://conll.cemantix.org/2012/download/ids/

For each language there are two sub-directories— “all” contains more general lists which includedocuments that had at least one of the layers of an-notation, and “coref” contains the lists that includedocument that have coreference annotation. The formerwere used to generate training/development/test setsfor layers other than coreference, and the latter wasused to generate training/development/test sets for thecoreference layer used in this shared task.

4.4.1 Manual Annotation Gold LayersLet us take a look at the manually annotated, or

gold layers of information that were made availablefor the training data.

Coreference The manual coreference annotationis stored as chains of linked mentions connectingmultiple mentions of the same entity. Coreference isthe only document-level phenomenon in OntoNotes,and the complexity of annotation increases non-linearly with the length of a document. Unfortu-nately, some of the documents — especially theones in the broadcast conversation, weblogs, andtelephone conversation genre — are very long andthat prohibited efficient annotation in their entirety.These had to be split into smaller parts. A few passesto join some adjacent parts were conducted, butsince some documents had as many as 17 parts, thereare still multi-part documents in the corpus. Sincethe coreference chains are coherent only within eachof these document parts, for the purpose of this task,each such part is treated as a separate document. An-other thing to note is that there were some casesof sub-token annotation in the corpus owing to thefact that tokens were not split at hyphens. Casessuch as pro-WalMart had the sub-span WalMart linkedwith another instance of WalMart. The recent Tree-bank revision split tokens at most hyphens and madea majority of these sub-token annotations go away.There were still some residual sub-token annota-tions. Since subtoken annotations cannot be repre-sented in the CoNLL format, and they were a verysmall quantity — much less than even half a per-cent — we decided to ignore them. Unlike English,Chinese and Arabic have coreference annotation onelided subjects/objects. Recovering these entities intext is a hard problem, and the most recently re-ported numbers in literature for Chinese are arounda F-score of 50 (Yang and Xue, 2010; Cai et al.,2011b). For Arabic there have not been much stud-ies on recovering these. A study by Gabbard (2010)shows that these can be recovered with an F-score

12

Page 13: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted ... · erence resolution was conducted on the MUC cor-pora, using a decision tree learner, by Soon et al. (2001). Signicant

of 55 with automatic parses and roughly 65 usinggold parses13. Considering the level of predictionaccuracy of these tokens, and the relative frequencyof the same, plus the fact that the CoNLL tabularformat is not amenable to a variable number of to-kens, we decided not to consider them as part ofthe task. In other words, we removed the manuallyidentified traces (*pro* and *) respectively in Chi-nese and Arabic Treebanks. We also do not considerthe links that are formed by these tokens in the goldevaluation key.

Tables 4 and 5 shows the distribution of mentionsby the syntactic categories, and the counts of enti-ties, links and mentions in the corpus respectively.Interestingly the mentions formed by these droppedpronouns total roughly about 11% for both Chineseand Arabic. All of this data has been Treebankedand PropBanked either as part of the OntoNotes ef-fort, or some previous effort.

Language Syntactic Train Development Testcategory Count % Count % Count %

English Noun Phrase 61.8K 39.46 9.7K 45.57 9.2K 42.97Pronoun 66.7K 42.61 7.8K 36.66 8.2K 38.69Proper Noun 18.1K 11.60 2.2K 10.66 2.3K 10.96Dropped Pro. - - - - - -Other Noun 2.636 1.68 546 2.55 500 2.33Verb 2.522 1.61 299 1.40 342 1.60Other 4.761 3.04 676 3.16 738 3.45

Chinese Noun Phrase 40.7K 34.23 5.4K 32.53 5.1K 35.31Pronoun 20.8K 17.50 3.3K 19.88 2.5K 17.65Dropped Pro. 13.5K 11.39 1.9K 12.04 1.5K 10.71Proper Noun 19.0K 15.96 2.8K 17.24 2.2K 15.54Other Noun 23.6K 19.88 2.8K 17.08 2.8K 19.71Verb 244 0.20 51 0.31 20 0.14Other 994 0.83 153 0.92 139 0.95

Arabic Noun Phrase 10.8K 34.93 1.3K 35.02 1.3K 36.51Pronoun 8.9K 28.77 1.0K 28.33 1.1K 30.58Dropped Pro. 3.5K 11.52 477 12.57 429 11.78Proper Noun 4.0K 13.01 450 11.86 390 10.71Other Noun 3.3K 10.90 439 11.57 345 9.47Verb 25 0.08 4 0.11 0 0.00Other 247 0.79 21 0.55 35 0.96

Table 4: Distribution of mentions in the data by their syn-tactic category.

Parse Trees These represent the syntactic layerthat is a revised version of the treebanks in English,Chinese and Arabic. Arabic treebank has probablyseen the most revision over the past few years, in aneffort to increase consistency. For purposes of thistask, traces were removed from the syntactic trees,since the CoNLL-style data format, being indexedby tokens, does not provide any good means of con-veying that information. As mentioned in the previ-ous section, these include the cases of traces in Chi-nese and Arabic which are dropped subjects/objects

13These numbers are not in the thesis, but we received themin an email communication with the Ryan Gabbard.

Language Type Train Development Test All

English Entities/Chains 35,143 4,546 4,532 44,221Links 120,417 14,610 15,232 150,259Mentions 155,560 19,156 19,764 194,480

Chinese Entities/Chains 28,257 3,875 3,559 35,691Links 74,597 10,308 9,242 94,147Mentions 102,854 14,183 12,801 129,838

Arabic Entities/Chains 8,330 936 980 10,246Links 19,260 2,381 2,255 23,896Mentions 27,590 3,313 3,235 34,138

Table 5: Number of entities, links and mentions in theOntoNotes v5.0 data.

that are legitimate targets for coreference annota-tion. Function tags were also removed, since theparsers that we used for the predicted syntax layerdid not provide them. One thing that needs to bedealt with in conversational data is the presence ofdisfluencies (restarts, etc.). In the English parsesof the OntoNotes, the disfluencies are marked usinga special EDITED14 phrase tag — as was the casefor the Switchboard Treebank. Given the frequencyof disfluencies and the performance with which onecan identify them automatically,15 a probable pro-cessing pipeline would filter them out before pars-ing. Since we did not have a readily available tag-ger for tagging disfluencies, we decided to removethem using oracle information available in the En-glish Treebank, and the coreference chains wereremapped to trees without disfluencies. Owing tovarious constraints, we decided to retain the disflu-encies in the Chinese data. Since Arabic portion ofthe corpus is all newswire, this had no impact onit. However, for both Chinese and Arabic, since weremove trace tokens corresponding to dropped pro-nouns, all the other layers of annotation had to beremapped to the remaining sequence of tree tokens.

Propositions The propositions in OntoNotes arePropBank-style semantic roles for English, Chineseand Arabic. Most of the verb predicates in the cor-pus have been annotated with their arguments. Aspart of the OntoNotes effort, some enhancementswere made to the English PropBank and Treebankto make them synchronize better with each other(Babko-Malaya et al., 2006). One of the outcomesof this effort was that two types of LINKs that rep-resent pragmatic coreference (LINK-PCR) and selec-

14There is another phrase type — EMBED in the telephoneconversation genre which is similar to the EDITED phrase type,and sometimes identifies insertions, but sometimes containslogical continuation of phrases by different speakers, so we de-cided not to remove that from the data.

15A study by Charniak and Johnson (2001) shows that onecan identify and remove edits from transcribed conversationalspeech with an F-score of about 78, with roughly 95 Precisionand 67 recall.

13

Page 14: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted ... · erence resolution was conducted on the MUC cor-pora, using a decision tree learner, by Soon et al. (2001). Signicant

tional preferences (LINK-SLC) were added to Prop-Bank. More details can be found in the addendum tothe PropBank guidelines16 in the OntoNotes v5.0 re-lease. Since the community is not used to this repre-sentation which relies heavily on the trace structurein the Treebank which we are excluding, we decidedto unfold the LINKs back to their original represen-tation as in the PropBank 1.0 release. This function-ality is part of the OntoNotes DB Tool.17

Word Sense Gold standard word sense annotationwas supplied using sense numbers (along with thesense inventories) as specified in the OntoNotes listof senses for each lemma. The coverage of the wordsense annotation varies among the languages. En-glish has the most coverage, while coverage for Chi-nese and Arabic is more sporadic. Even for English,the coverage for word sense annotation is not com-plete. Only some of the verbs and nouns are anno-tated with word sense information.

Named Entities Named Entities in OntoNotesdata are specified using a catalog of 18 Name types.

Other Layers Discourse plays a vital role incoreference resolution. In the case of broadcast con-versation, or telephone conversation data, it partiallymanifests itself in the form of speakers of a given ut-terance, whereas in weblogs or newsgroups it doesso as the writer, or commenter of a particular articleor thread. This information provides an importantclue for correctly linking anaphoric pronouns withthe right antecedents. This information could be au-tomatically deduced, but since it would add addi-tional complexity to the already complex task, wedecided to provide oracle information of this meta-data both during training and testing. In other words,speaker and author identification was not treated asan annotation layer that needed to be predicted. Thisinformation was provided in the form of another col-umn in the .conll file. There were some cases ofinterruptions and interjections that led to a sentenceassociated with two different speakers, but since thefrequency of this was quite small, we decided tomake an assumption of one speaker/writer per sen-tence.

4.4.2 Predicted Annotation LayersThe predicted annotation layers were derived us-

ing automatic models trained using cross-validationon other portions of OntoNotes v5.0 data. Asmentioned earlier, there are some portions of theOntoNotes corpus that have not been annotated forcoreference but that have been annotated for otherlayers. For training models for each of the layers,where feasible, we used all the data that we could

16doc/propbank/english-propbank.pdf17http://cemantix.org/ontonotes.html

Layer English Chinese ArabicVerb Noun All Verb Noun

Sense Inventories 2702 2194 763 150 111Frames 5672 1335 20134 2743 532

Table 7: Number of senses defined for English, Chineseand Arabic in the OntoNotes v5.0 corpus.

for that layer from the training portion of the entireOntoNotes v5.0 release.

Parse Trees Predicted parse trees for English wereproduced using the Charniak parser18 (Charniak andJohnson, 2005). Some additional tag types usedin the OntoNotes trees were added to the parser’stagset, including the NML tag that has recently beenadded to capture internal NP structure, and the rulesused to determine head words were extended corre-spondingly. Chinese and Arabic parses were gen-erated using the Berkeley parser (Petrov and Klein,2007). In the case of Arabic, the parsing commu-nity uses a mapping from rich Arabic part of speechtags, to Penn-style part of speech tags. We used themapping that is included with the Arabic treebank.

The predicted parses for the training portion ofthe data were generated using 10-fold (5-fold forArabic) cross-validation. The development and testparses were generated using a model trained on theentire training portion. We used OntoNotes v5.0training data for training the Chinese and Arabicparser models, but the OntoNotes v4.0 subset ofOntoNotes v5.0 data was used for training the En-glish model. We decided to do the latter to be able tobetter compare the scores to the CoNLL-2011 eval-uation given that parser is a central component to acoreference system, and the fact that OntoNotes v5.0adds a small fraction of gold parses on top of thoseprovided by OntoNotes v4.0. Table 6 shows the per-formance of the re-trained parsers on the CoNLL-2012 test set. We did not get a chance to re-train there-ranker available for English, and since the stockre-ranker crashes when run on n-best parses contain-ing NMLs, because it has not seen that tag in train-ing, we could not make use of it. In addition to theparser scores and part of speech accuracy, we havealso added a column for the accuracy for the NPsbecause they are particularly relevant to the corefer-ence task.

18http://bllip.cs.brown.edu/download/reranking-parserAug06.tar.gz

19There was an error in processing the test set, therefore theperformance on the test set was slightly lower than the correctone reported in the table. The performance of the sense taggingthe offical test set is 77.6 (R), 71.5 (P) and 74.4 (F).

14

Page 15: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted ... · erence resolution was conducted on the MUC cor-pora, using a decision tree learner, by Soon et al. (2001). Signicant

All Sentences Sentence length < 40N POS NP R P F N R P F

English Broadcast Conversation [BC] 2,194 95.93 90.05 84.30 84.46 84.38 2,124 85.83 85.97 85.90Broadcast News [BN] 1,344 96.50 91.11 84.19 84.28 84.24 1,278 85.93 86.04 85.98Magazine [MZ] 780 95.14 91.63 87.11 87.46 87.28 736 87.71 88.04 87.87Newswire [NW] 2,273 96.95 90.14 87.05 87.45 87.25 2,082 88.95 89.27 89.11Telephone Conversation [TC 1,366 93.52 88.96 79.73 80.83 80.28 1,359 79.88 80.98 80.43Weblogs and Newsgroups [WB] 1,658 94.67 89.16 83.32 83.20 83.26 1,566 85.14 85.07 85.11Pivot Text [PT] (New Testament) 1,217 96.87 95.39 92.48 93.66 93.07 1,217 92.48 93.66 93.07

Overall 9,615 96.03 90.78 85.25 85.43 85.34 9145 86.86 87.02 86.94

Chinese Broadcast Conversation [BC] 885 94.79 86.32 79.35 80.17 79.76 824 80.92 81.86 81.38Broadcast News [BN] 929 93.85 86.00 80.13 83.49 81.78 756 81.82 84.65 83.21Magazine [MZ] 451 97.06 92.40 83.85 88.48 86.10 326 85.64 89.80 87.67Newswire [NW] 481 94.07 79.70 77.28 82.26 79.69 406 79.06 83.84 81.38Telephone Conversation [TC] 968 92.22 80.15 69.19 71.90 70.52 942 69.59 72.24 70.89Weblogs and Newsgroups [WB] 758 92.37 85.60 78.92 82.57 80.70 725 79.30 83.10 81.16

Overall 4,472 94.12 85.74 78.93 82.23 80.55 3,979 79.80 82.79 81.27

Arabic Newswire [NW] 1,003 94.12 80.70 75.67 74.71 75.19 766 77.44 74.99 76.19

Table 6: Parser performance on the CoNLL-2012 test set.

AccuracyR P F

English Broadcast Conversation [BC] 81.3 81.2 81.2Broadcast News [BN] 81.5 82.0 81.7Magazine [MZ] 78.8 79.1 79.0Newswire [NW] 85.7 85.7 85.7Weblogs and Newsgroups [WB] 77.6 77.5 77.5

Overall 82.5 82.5 82.5

Chinese Broadcast Conversation [BC] - - 80.5Broadcast News [BN] - - 85.4Magazine [MZ] - - 82.4Newswire [NW] - - 89.1

Overall - - 84.3

Arabic Newswire [NW]19 75.2 75.9 75.6

Table 8: Word sense performance over both verbs and nouns in the CoNLL-2012 test set.

Word Sense This year we used the IMS (It MakesSense) (Zhong and Ng, 2010) word sense tagger.20

Word sense information, unlike syntactic parse in-formation is not central to approaches taken by cur-rent coreference systems and so we decided to usea better word sense tagger to get a good state of theart accuracy estimate, at the cost of a completely fair(but, still close enough) comparison with EnglishCoNLL-2011 results. This will also allow potentialfuture uses to benefit from it. IMS was trained onall the word sense data that is present in the train-ing portion of the OntoNotes corpus using cross-validated predictions on the input layers similar tothe proposition tagger. During testing, for Englishand Arabic, IMS must first uses the automatic POSinformation to identify the nouns and verbs in thetest data, and then assign senses to the automatically

20We offer special thanks to Hwee Tou Ng and his studentZhi Zhong for training IMS models and providing output forthe development and test sets.

identified nouns and verbs. In case of Arabic, IMSuses gold lemmas. Since automatic POS tagging isnot perfect, IMS does not always output a sense toall word tokens that need to be sense tagged due towrongly predicted POS tags. As such, recall is notthe same as precision on the English and Arabic testdata. Recall that in Chinese, the word senses aredefined against lemmas and are independent of thepart of speech. Since we provide gold word seg-mentation, IMS attempts to sense tag all correctlysegmented Chinese words, so recall and precisionare same and so is F1. Table 7 gives the number oflemmas covered by the word sense inventory in theEnglish, Chinese and Arabic portion of OntoNotes.

Table 8 shows the performance of this classifieraggregated over both the verbs and nouns in theCoNLL-2012 test set. For English, genres PT andTC, and for Chinese genres TC and WB, no gold stan-dard senses were available, and so their accuraciescould not be computed.

15

Page 16: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted ... · erence resolution was conducted on the MUC cor-pora, using a decision tree learner, by Soon et al. (2001). Signicant

Propositions We used ASSERT21 (Pradhan et al.,2005) to predict the propositional structure for En-glish. Similar to the parser model for English,the same proposition model that was used in theCoNLL-2011 shared task — trained on all the train-ing portion of the OntoNotes v4.0 data using cross-validated predicted parses — was used to generatethe propositions for the development and test setsfor this evaluation. We took a two stage approachto tagging where The NULL arguments are first fil-tered out, and the remaining NON-NULL argumentsare classified into one of the argument types. Theargument identification module used an ensembleof ten classifiers — each trained on a tenth of thetraining data and combined using unweighted vot-ing. This should still give a close to state-of-the-art performance given that the argument identifi-cation performance tends to start to be asymptoticaround 10K training instances (Pradhan et al., 2005).The Chinese propositional structure was predictedwith the Chinese semantic role labeler described in(Xue, 2008), retrained on all the training portion ofthe OntoNotes v5.0 data. No propositional struc-tures were provided for Arabic due to resource con-straints. Table 9 shows the detailed performancenumbers. The CoNLL-2005 scorer was used to com-pute the scores. At first glance, the performance onthe English newswire genre is much lower than whathas been reported for WSJ Section 23. This couldbe attributed to several factors: i) the fact that wehad to compromise on the training method, ii) thenewswire in OntoNotes not only contains WSJ data,but also Xinhua news, iii) The WSJ training and testportions in OntoNotes are a subset of the standardones that have been used to report performance ear-lier; iv) the PropBank guidelines were significantlyrevised during the OntoNotes project in order to syn-

21http://cemantix.org/assert.html

Framesets Lemmas

1 2,7222 321

> 2 181

Table 10: Frameset polysemy across lemmas.

chronize well with the Treebank, and finally v) it in-cludes propositions for be verbs missing from theoriginal PropBank. It looks like the newly addedPivot Text data (comprising of the New Testament)shows very good performance. This is not surprisinggiven a similar trend in it parsing performance.

In addition to automatically predicting the argu-ments, we also trained a classifier to tag PropBankframeset IDs for the English data. Table 7 liststhe number of framesets available across the threelanguages22 An overwhelming number of them aremonosemous, but the more frequent verbs tend to bepolysemous. Table 10 gives the distribution of num-ber of framesets per lemma in the PropBank layer ofthe English OntoNotes v5.0 data.

During automatic processing of the data, wetagged all the tokens that were tagged with a partof speech VBx. This means that there would be caseswhere the wrong token would be tagged with propo-sitions.Named Entities BBN’s IdentiFinderTMsystemwas used to predict the named entities. For theCoNLL-2011 shared task we did not get a chanceto re-train Identifinder, and used the stock modelwhich did not have the same set of named entitiesas in the OntoNotes corpus, so we decided to

22The number of lemmas for English in Table 10 do not addup to this number because not all of them have examples inthe training data, where the total number of instantiated sensesamounts to 4229.

Frameset Total Total % Perfect Argument ID + ClassAccuracy Sentences Propositions Propositions P R F

English Broadcast Conversation [BC] 92 2,037 5,021 52.18 82.55 64.84 72.63Broadcast News [BN] 91 1,252 3,310 53.66 81.64 64.46 72.04Magazine [MZ] 89 780 2,373 47.16 79.98 61.66 69.64Newswire [NW] 93 1,898 4,758 39.72 80.53 62.68 70.49Telephone Conversation [TC] 90 1,366 1,725 45.28 79.60 63.41 70.59Weblogs and Newsgroups [WB] 92 929 2,174 39.19 81.01 60.65 69.37Pivot Corpus [PT] 92 1,217 2,853 50.54 86.40 72.61 78.91

Overall 91 9,479 24,668 44.69 81.47 61.56 70.13

Chinese Broadcast Conversation [BC] - 885 2,323 31.34 53.92 68.60 60.38Broadcast News [BN] - 929 4,419 35.44 64.34 66.05 65.18Magazine [MZ] - 451 2,620 31.68 65.04 65.40 65.22Newswire [NW] - 481 2,210 27.33 69.28 55.74 61.78Telephone Conversation [TC] - 968 1,622 32.74 48.70 59.12 53.41Weblogs and Newsgroups [WB] - 758 1,761 35.21 62.35 68.87 65.45

Overall - 4,472 14,955 32.62 61.26 64.48 62.83

Table 9: Performance on the propositions and framesets in the CoNLL-2012 test set.

16

Page 17: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted ... · erence resolution was conducted on the MUC cor-pora, using a decision tree learner, by Soon et al. (2001). Signicant

All Genre BC BN MZ NW TC WBF F F F F F F

English Cardinal 68.76 58.52 75.34 72.57 83.62 32.26 57.14Date 78.60 73.46 80.61 71.60 84.12 63.89 65.48Event 44.63 30.77 50.00 36.36 50.00 0.00 66.67Facility 47.29 64.20 43.14 40.00 54.17 0.00 28.57GPE 89.77 89.40 93.83 92.87 92.56 81.19 91.36Language 47.06 - 75.00 50.00 33.33 22.22 66.67Law 48.00 0.00 100.00 0.00 50.98 0.00 100.00Location 59.00 54.55 61.36 54.84 67.10 - 44.44Money 75.45 33.33 63.64 77.78 79.12 92.31 58.18NORP 88.58 94.55 93.92 94.87 90.70 78.05 85.15Ordinal 71.39 74.16 80.49 79.07 74.34 84.21 55.17Organization 76.00 60.90 78.57 69.97 84.76 48.98 51.08Percent 89.11 100.00 83.33 75.00 91.41 83.33 72.73Person 78.75 93.35 94.36 87.47 85.80 73.39 76.49Product 52.76 0.00 77.65 0.00 42.55 0.00 0.00Quantity 50.00 17.14 66.67 62.86 81.82 0.00 30.77Time 60.65 66.13 67.33 66.67 64.29 27.03 55.56Work of Art 34.03 42.42 35.62 28.57 54.24 0.00 8.70

Overall 77.95 77.02 84.95 80.33 84.73 62.17 69.47

Table 11: Named Entity performance on the English subset of the CoNLL-2012 test set.

update the model for this round by retraining iton the English portion of the OntoNotes v5.0corpus. Given the various constraints, we could notre-train it on the Chinese and Arabic data, Table11 shows the overall performance of the taggeron the CoNLL-2012 English test set, as well asthe performance broken down by individual nametypes.

Other Layers As noted earlier, systems were al-lowed to make use of gender and number predic-tions for NPs using the table from Bergsma and Lin(Bergsma and Lin, 2006), and the speaker meta-data for broadcast conversations, telephone conver-sations and author or poster metadata for weblogsand newsgroups.

4.4.3 Data FormatIn order to organize the multiple, rich layers of

annotation, the OntoNotes project has created adatabase representation for the raw annotation layersalong with a Python API to manipulate them (Prad-han et al., 2007a). In the OntoNotes distribution thedata is organized as one file per layer, per document.The API requires a certain hierarchical structurewith various annotation layers represented by fileextensions for the documents at the leaves, andlanguage, genre, source and section within a partic-ular source forming the intermediate directories —data/<language>/annotations/<genre>/<source>/

<section>/<document>.<layer>. It comes withvarious ways of querying and manipulating the dataand allows convenient access to the informationinside the sense inventory and PropBank framefiles instead of having to interpret the raw .xml.However, maintaining format consistency withearlier CoNLL tasks was deemed convenient for

sites that already had tools configured to dealwith that format. Therefore, in order to distributethe data so that one could make the best of bothworlds, we created a new file extension — .conll

which logically served as another layer in additionto the .parse, .prop, .sense, .name and .coref

layers which house the respective annotations.Each .conll file contained a merged representationof all the OntoNotes layers in the CoNLL-styletabular format with one line per token, and withmultiple columns for each token specifying theinput annotation layers relevant to that token, withthe final column specifying the target coreferencelayer. Because we are not authorized to distributethe underlying text, and many of the layers containinline annotation, we had to provide a skeletal form(.skel) of the .conll file which is essentially the.conll file, but with the column that contains thewords, anonymized. We provided an assemblyscript that participants could use to create a .conll

file taking as input the .skel file and the top-leveldirectory of the OntoNotes distribution that theyhad separately downloaded from the LDC23. Oncethe .conll file is created, it can be used to createthe individual layers such as .parse, .name, and.coref that have inline annotation, with the providedscripts. We provide the layers that have standoffannotation (mostly with respect to the tokens in thetreebank) like the .prop and .sense along with the.skel file.

In the CoNLL-2011 task, there were a few issues,where some teams used the test data accidentallyduring training. To prevent it from happening again

23OntoNotes is deeply grateful to the Linguistic Data Con-sortium for making the source data freely available to the taskparticipants.

17

Page 18: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted ... · erence resolution was conducted on the MUC cor-pora, using a decision tree learner, by Soon et al. (2001). Signicant

Column Type Description

1 Document ID This is a variation on the document filename2 Part number Some files are divided into multiple parts numbered as 000, 001, 002, ... etc.3 Word number This is the word index in the sentence4 Word The word itself5 Part of Speech Part of Speech of the word6 Parse bit This is the bracketed structure broken before the first open parenthesis in the parse, and the

word/part-of-speech leaf replaced with a *. The full parse can be created by substitutingthe asterisk with the ([pos] [word]) string (or leaf) and concatenating the items inthe rows of that column.

7 Lemma The predicate/sense lemma is mentioned for the rows for which we have semantic role orword sense information. All other rows are marked with a -

8 Predicate Frameset ID This is the PropBank frameset ID of the predicate in Column 7.9 Word sense This is the word sense of the word in Column 4.10 Speaker/Author This is the speaker or author name where available. Mostly in Broadcast Conversation and

Weblog data.11 Named Entities These columns identifies the spans representing various named entities.12:N Predicate Arguments There is one column each of predicate argument structure information for the predicate

mentioned in Column 7.N Coreference Coreference chain information encoded in a parenthesis structure.

Table 12: Format of the .conll file used in the shared task.

#begin document (nw/wsj/07/wsj_0771); part 000......nw/wsj/07/wsj_0771 0 0 ‘‘ ‘‘ (TOP(S(S* - - - - * * (ARG1* * * -nw/wsj/07/wsj_0771 0 1 Vandenberg NNP (NP* - - - - (PERSON) (ARG1* * * * (8|(0)nw/wsj/07/wsj_0771 0 2 and CC * - - - - * * * * * -nw/wsj/07/wsj_0771 0 3 Rayburn NNP *) - - - - (PERSON) *) * * *(23)|8)nw/wsj/07/wsj_0771 0 4 are VBP (VP* be 01 1 - * (V*) * * * -nw/wsj/07/wsj_0771 0 5 heroes NNS (NP(NP*) - - - - * (ARG2* * * * -nw/wsj/07/wsj_0771 0 6 of IN (PP* - - - - * * * * * -nw/wsj/07/wsj_0771 0 7 mine NN (NP*)))) - - 5 - * *) * * * (15)nw/wsj/07/wsj_0771 0 8 , , * - - - - * * * * * -nw/wsj/07/wsj_0771 0 9 ’’ ’’ *) - - - - * * *) * * -nw/wsj/07/wsj_0771 0 10 Mr. NNP (NP* - - - - * * (ARG0* (ARG0* * (15nw/wsj/07/wsj_0771 0 11 Boren NNP *) - - - - (PERSON) * *) *) * 15)nw/wsj/07/wsj_0771 0 12 says VBZ (VP* say 01 1 - * * (V*) * * -nw/wsj/07/wsj_0771 0 13 , , * - - - - * * * * * -nw/wsj/07/wsj_0771 0 14 referring VBG (S(VP* refer 01 2 - * * (ARGM-ADV* (V*) * -nw/wsj/07/wsj_0771 0 15 as RB (ADVP* - - - - * * * (ARGM-DIS* * -nw/wsj/07/wsj_0771 0 16 well RB *) - - - - * * * *) * -nw/wsj/07/wsj_0771 0 17 to IN (PP* - - - - * * * (ARG1* * -nw/wsj/07/wsj_0771 0 18 Sam NNP (NP(NP* - - - - (PERSON* * * * * (23nw/wsj/07/wsj_0771 0 19 Rayburn NNP *) - - - - *) * * * * -nw/wsj/07/wsj_0771 0 20 , , * - - - - * * * * * -nw/wsj/07/wsj_0771 0 21 the DT (NP(NP* - - - - * * * * (ARG0* -nw/wsj/07/wsj_0771 0 22 Democratic JJ * - - - - (NORP) * * * * -nw/wsj/07/wsj_0771 0 23 House NNP * - - - - (ORG) * * * * -nw/wsj/07/wsj_0771 0 24 speaker NN *) - - - - * * * * *) -nw/wsj/07/wsj_0771 0 25 who WP (SBAR(WHNP*) - - - - * * * * (R-ARG0*) -nw/wsj/07/wsj_0771 0 26 cooperated VBD (S(VP* cooperate 01 1 - * * * * (V*) -nw/wsj/07/wsj_0771 0 27 with IN (PP* - - - - * * * * (ARG1* -nw/wsj/07/wsj_0771 0 28 President NNP (NP* - - - - * * * * * -nw/wsj/07/wsj_0771 0 29 Eisenhower NNP *))))))))))) - - - - (PERSON) * *) *) *) 23)nw/wsj/07/wsj_0771 0 30 . . *)) - - - - * * * * * -

nw/wsj/07/wsj_0771 0 0 ‘‘ ‘‘ (TOP(S* - - - - * * * -nw/wsj/07/wsj_0771 0 1 They PRP (NP*) - - - - * (ARG0*) * (8)nw/wsj/07/wsj_0771 0 2 allowed VBD (VP* allow 01 1 - * (V*) * -nw/wsj/07/wsj_0771 0 3 this DT (S(NP* - - - - * (ARG1* (ARG1* (6nw/wsj/07/wsj_0771 0 4 country NN *) - - 3 - * * *) 6)nw/wsj/07/wsj_0771 0 5 to TO (VP* - - - - * * * -nw/wsj/07/wsj_0771 0 6 be VB (VP* be 01 1 - * * (V*) (16)nw/wsj/07/wsj_0771 0 7 credible JJ (ADJP*))))) - - - - * *) (ARG2*) -nw/wsj/07/wsj_0771 0 8 . . *)) - - - - * * * -#end document

Figure 3: Sample portion of the .conll file.

18

Page 19: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted ... · erence resolution was conducted on the MUC cor-pora, using a decision tree learner, by Soon et al. (2001). Signicant

this year, we were advised by the steering commit-tee to distribute the data in two installments. One fortraining and development and the other for testing.The test data released from LDC did not contain thecoreference layer. Therefore, this year unlike previ-ous CoNLL tasks, the test data contained some trulyunseen documents. This made it easier to spot po-tential training errors such as ones that occurred inthe CoNLL-2011 task. Table 12 describes the dataprovided in each of the column of the .conll format.Figure 3 shows a sample from a .conll file.

4.5 EvaluationThis section describes the evaluation criteria used

for the shared task. Unlike propositions, word senseand named entities, where it is simply a matter ofcounting the correct answers, or for parsing, wherethere is an established metric, evaluating the accu-racy of coreference continues to be contentious. Var-ious alternative metrics have been proposed, as men-tioned below, which weight different features of aproposed coreference pattern differently. The choiceis not clear in part because the value of a particularset of coreference predictions is integrally tied to theconsuming application. A further issue in defining acoreference metric concerns the granularity of thementions, and how closely the predicted mentionsare required to match those in the gold standard fora coreference prediction to be counted as correct.Our evaluation criterion was in part driven by theOntoNotes data structures. OntoNotes coreferencemakes the distinction between identity coreferenceand appositive coreference, treating the latter sepa-rately. Thus we evaluated systems only on the iden-tity coreference task, which links all categories ofentities and events together into equivalent classes.The situation with mentions for OntoNotes is alsodifferent than it was for MUC or ACE. OntoNotesdata does not explicitly identify the minimum ex-tents of an entity mention, but it does include hand-tagged syntactic parses. Thus for the official evalua-tion, we decided to use the exact spans of mentionsfor determining correctness. The NP boundariesfor the test data were pre-extracted from the hand-tagged Treebank for annotation, and events trig-gered by verb phrases were tagged using the verbsthemselves. This choice means that scores for theCoNLL-2012 coreference task are likely to be lowerthan for coreference evaluations based on MUC, orACE data, where an approximate match is often al-lowed based on the specified head of the mentions.

4.5.1 MetricsAs noted above, the choice of an evaluation met-

ric for coreference has been a tricky issue and theredoes not appear to be any silver bullet that addressesall the concerns. Three metrics have been commonlyused for evaluating coreference performance over an

unrestricted set of entity types: i) The link basedMUC metric (Vilain et al., 1995), ii) The mentionbased B-CUBED metric (Bagga and Baldwin, 1998)and iii) The entity based CEAF (Constrained En-tity Aligned F-measure) metric (Luo, 2005). Veryrecently BLANC (BiLateral Assessment of Noun-Phrase Coreference) measure (Recasens and Hovy,2011) has been proposed as well. Each metric triesto address the shortcomings or biases of the earliermetrics. Given a set of key entities K, and a set ofresponse entitiesR, with each entity comprising oneor more mentions, each metric generates its variationof a precision and recall measure. The MUC measureis the oldest and most widely used. It focuses onthe links (or, pairs of mentions) in the data.24 Thenumber of common links between entities in K andR divided by the number of links in K representsthe recall, whereas, precision is the number of com-mon links between entities in K and R divided bythe number of links in R. This metric prefers sys-tems that have more mentions per entity — a sys-tem that creates a single entity of all the mentionswill get a 100% recall without significant degrada-tion in its precision. And, it ignores recall for single-ton entities, or entities with only one mention. TheB-CUBED metric tries to addresses MUC’s shortcom-ings, by focusing on the mentions and computes re-call and precision scores for each mention. If K isthe key entity containing mention M, and R is the re-sponse entity containing mention M, then recall forthe mention M is computed as |K∩R|

|K| and precision

for the same is is computed as |K∩R||R| . Overall recall

and precision are the average of the individual men-tion scores. CEAF aligns every response entity withat most one key entity by finding the best one-to-onemapping between the entities using an entity simi-larity metric. This is a maximum bipartite matchingproblem and can be solved by the Kuhn-Munkresalgorithm. This is thus a entity based measure. De-pending on the similarity, there are two variations— entity based CEAF — CEAFe and a mention basedCEAF — CEAFm. Recall is the total similarity di-vided by the number of mentions in K, and preci-sion is the total similarity divided by the number ofmentions in R. Finally, BLANC uses a variation onthe Rand index (Rand, 1971) suitable for evaluatingcoreference. There are a few other measures — onebeing the ACE value, but since this is specific to arestricted set of entities (ACE types), we did not con-sider it.

4.5.2 Official Evaluation MetricIn order to determine the best performing system

in the shared task, we needed to associate a single

24The MUC corpora did not tag single mention entities.

19

Page 20: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted ... · erence resolution was conducted on the MUC cor-pora, using a decision tree learner, by Soon et al. (2001). Signicant

number with each system. This could have beenone of the metrics above, or some combination ofmore than one of them. The choice was not sim-ple, and after having consulted various researchersin the field, we came to a conclusion that each met-ric had its pros and cons and there is no silver bul-let. Therefore we settled on the MELA metric pro-posed by Denis and Baldridge (2009), which takes aweighted average of three metrics: MUC, B-CUBED,and CEAF. The rationale for the combination is thateach of the three metrics represents a different, im-portant dimension. The MUC measure is based onlinks. The B-CUBED is based on mentions, and theCEAF is based on entities. We decided to use the en-tity based CEAFe instead of mention based CEAFm.For a given end application, a weighted average ofthe three might be optimal, but since we don’t havea particular end task in mind, we decided to use theunweighted mean of the three metrics as the scoreon which the winning system was judged. This stillleaves us with a score for each language. We wantedto encourage researchers to run their systems on allthree languages. Therefore, we decided to computethe final official score that would determine the win-ning submission as the average of the MELA metricacross all the three languages. We decided to give aMELA score of zero to every language that a partic-ular group did not run its system on.

4.5.3 Scoring Metrics ImplementationWe used the same core scorer implementation25

that was used for the SEMEVAL-2010 task, andwhich implemented all the different metrics. Therewere a couple of modifications done to this scorersince then.

1. Only exact matches were considered cor-rect. Previously, for SEMEVAL-2010 non-exact matches were judged partially correctwith a 0.5 score if the heads were the sameand the mention extent did not exceed the goldmention.

2. The modifications suggested by Cai and Strube(2010) have been incorporated in the scorer.

Since there are differences in the version used forCoNLL and the one available on the download site,and it is possible that the latter would be revised inthe future, we have archived the version of the scoreron the CoNLL-2012 task webpage.26

5 ParticipantsA total of 41 different groups demonstrated in-

terest in the shared task by registering on the task

25http://www.lsi.upc.edu/∼esapena/downloads/index.php?id=326http://conll.bbn.com/download/scorer.v4.tar.gz

webpage. Of these, 16 groups from 6 countries sub-mitted system outputs on the test set during the eval-uation week. 15 groups participated in at least onelanguage in the closed task, and only one group par-ticipated solely in the open track. One participant(yang) did not submit a final task paper. Tables 13and 14 list the distribution of the participants bycountry and the participation by language and tasktype.

Country Participants

Brazil 1China 8Germany 3Italy 1Switzerland 1USA 2

Table 13: Participation by country.

Closed Open Combined

English 15 1 16Chinese 13 3 14Arabic 7 1 8

Table 14: Participation across languages and tracks.

6 ApproachesTables 15 and 16 summarize the approaches taken

by the participating systems along some importantdimensions. While referring to the participating sys-tems, as a convention, we will use the last name ofthe contact person from the participating team. It isalmost always the last name of the first author of thesystem papers, or the first name in case of conflictinglast names (xinxin). The only exception is chunyangwhich is the first name of the second author for thatsystem. For space and readability purposes, whilereferring to the systems in the paper we will referto the system by the primary contact name in italicsinstead of using explicit citations.

Most of the systems divided the problem into thetypical two phases — first identifying the potentialmentions in the text, and then linking the mentionsto form coreference chains, or entities. Many sys-tems used rule-based approaches for mention detec-tion, though one, yang did use trained models, andli used a hybrid approach by adding mentions froma trained model to the ones identified using rules.All systems ran a post processing stage, after linkingpotential mentions together, to delete the remainingunlinked mentions. It was common for the systemsto represent the markables (mentions) internally in

27The participant did not submit a final paper, so this infor-mation is based on an email correspondence.

20

Page 21: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted ... · erence resolution was conducted on the MUC cor-pora, using a decision tree learner, by Soon et al. (2001). Signicant

Part

icip

ant

Trac

kL

angu

ages

Synt

axL

earn

ing

Fram

ewor

kM

arka

ble

Iden

tifica

tion

Verb

Feat

ure

Sele

ctio

n#

Feat

ures

Trai

n

fern

ande

sC

A,C

,EP

Lat

entS

truc

ture

Perc

eptr

onA

llno

unph

rase

s,pr

onou

nsan

dna

me

entit

ies

×L

aten

tfea

ture

indu

ctio

nan

dfe

atur

ete

mpl

ates

196

tem

plat

es(E

);19

7(C

)and

223

(A)

T+

D

bjor

kelu

ndC

A,C

,EP

LIB

LIN

EA

Rfo

rlin

king

,and

Max

imum

Ent

ropy

(Mal

let)

for

anap

hori

city

NP,

PR

Pan

dP

RP$

inal

llan

guag

es;P

Nan

dN

Rin

Chi

nese

;all

NE

inE

nglis

h.C

lass

ifier

toex

clud

eno

n-re

fere

ntia

lpro

noun

sin

Eng

lish

(with

apr

obab

ility

of0.

95).

×G

reed

yfo

rwar

dse

lect

ion

(sem

i-au

tom

atic

)28

feat

ure

tem

plat

es(C

)and

34(E

)aT

+D

chen

C,O

A,C

,EP

Hyb

rid

—Si

eve

appr

oach

follo

wed

byla

ngua

ge-s

peci

fiche

uris

ticpr

unin

gan

dla

ngua

ge-i

ndep

ende

ntle

arni

ngba

sed

prun

ing;

Gen

resp

ecifi

cm

odel

s

NP,

PR

Pan

dP

RP$

and

sele

cted

NE

inE

nglis

h.N

Pan

dQ

Pin

Chi

nese

.Exc

lude

Chi

nese

inte

rrog

ativ

epr

onou

nsw

hata

ndw

here

.NP

and

sele

cted

NE

inA

rabi

c.L

earn

ing

topr

une

non-

refe

rent

ialm

entio

ns

×B

ackw

ard

elim

inat

ion

–T

stam

borg

CA

,C,E

DL

ogis

ticR

egre

ssio

n(L

IBL

INE

AR

)

NP,

PR

Pan

dP

RP$

inal

llan

guag

es;P

Nin

Chi

nese

;all

NE

inE

nglis

h.E

xclu

depl

eona

stic

itin

Eng

lish.

Prun

esm

alle

rmen

tions

with

sam

ehe

ad.

×Fo

rwar

d+

Bac

kwar

dst

artin

gfr

omC

oNL

L-2

011

feat

ure

set

forE

nglis

han

dSo

onfe

atur

ese

tfor

Chi

nese

and

Ara

bic

18–3

7T

+D

mar

tsch

atC

A,C

D

Dir

ecte

dm

ultig

raph

repr

esen

tatio

nw

here

the

wei

ghts

are

lear

ned

over

the

trai

ning

data

(on

top

ofB

AR

T(V

ersl

eyet

al.,

2008

))

Eig

htdi

ffer

entm

entio

nty

pes

forE

nglis

h,an

dad

ject

ival

use

forn

atio

nsan

da

few

NE

sar

efil

tere

das

wel

las

embe

dded

men

tions

and

pleo

nast

icpr

onou

ns.F

ourm

entio

nty

pes

inC

hine

se.C

opul

asar

eal

soha

ndle

dap

prop

riat

ely.

××

Inth

efo

rmof

nega

tive

and

posi

tive

rela

tions

20%

ofT

(E);

15%

ofT

(C)

chan

gC

E,C

PL

aten

tStr

uctu

reL

earn

ing

All

noun

phra

ses,

pron

ouns

and

nam

een

titie

×C

hang

,eta

l.,20

11T

+D

uryu

pina

CA

,C,E

Pm

odifi

catio

nof

BA

RT

usin

gm

ulti-

obje

ctiv

eop

timiz

atio

n.D

omai

nsp

ecifi

ccl

assi

fiers

for

nwan

dbc

genr

e.

Stan

dard

rule

sfo

rEng

lish

and

Cla

ssifi

erto

iden

tify

mar

kabl

eN

Psin

Chi

nese

and

Ara

bic.

××

∼45

T+

D

zhek

ova

CA

,C,E

PM

emor

yba

sed

lear

ning

(TIM

BL

)N

P,P

RP

and

PR

P$in

Eng

lish,

and

allN

Pin

Chi

nese

and

Ara

bic.

Sing

leto

ncl

assi

fier.

××

33T

+D

liC

A,C

,EP

Max

Ent

All

phra

sety

pes

that

are

men

tions

intr

aini

ngar

eco

nsid

ered

asm

entio

nsan

da

clas

sifie

ris

trai

ned

toid

entif

ypo

tent

ialm

entio

ns.

√×

11fe

atur

egr

oups

T+

D

yuan

C,O

E,C

PC

4.5

and

dete

rmin

istic

rule

sA

llno

unph

rase

s,pr

onou

nsan

dna

me

entit

ies

××

–T

+D

xuC

E,C

PD

ecis

ion

tree

clas

sifie

rand

dete

rmin

istic

rule

s

All

noun

phra

ses,

pron

ouns

and

sele

cted

nam

eden

titie

sse

lect

edan

dov

erla

ppin

gm

entio

nsar

eco

nsid

ered

whe

nth

eyar

ese

cond

-lev

elN

Psin

side

anN

P,fo

rexa

mpl

eco

ordi

natin

gN

Ps×

×51

(E)a

nd61

(C)

T

chun

yang

CE

,CP

Rul

e-ba

sed

(ada

ptat

ion

ofL

eeet

al.2

011’

ssi

eve

stru

ctur

e)C

hine

seN

Pan

dpr

onou

nsus

ing

part

ofsp

eech

PN

and

nam

esus

ing

part

ofsp

eech

NR

excl

udin

gm

easu

rew

ords

and

cert

ain

nam

es×

×–

yang

27C

EP

Stru

ctur

alSV

MM

entio

nde

tect

ion

clas

sifie

Sam

efe

atur

ese

t,bu

tper

clas

sifie

r40

T(2

011)

xinx

inC

E,C

PM

axE

ntN

P,P

RP

and

PR

P$in

Eng

lish

and

Chi

nese

×G

reed

yfo

rwar

dba

ckw

ard

71T

+D

shou

CE

PM

odifi

edve

rsio

nof

Lee

etal

.,20

11si

eve

syst

emxi

ong

OA

,C,E

PL

eeet

al.,

2011

syst

em

Tabl

e15

:Pa

rtic

ipat

ing

syst

empr

ofile

s—

Part

I.In

the

Task

colu

mn,

C/O

repr

esen

tsw

heth

erth

esy

stem

part

icip

ated

inth

ecl

osed

,ope

nor

both

trac

ks.

Inth

eSy

ntax

colu

mn,

aP

repr

esen

tsth

atth

esy

stem

sus

eda

phra

sest

ruct

ure

gram

mar

repr

esen

tatio

nof

synt

ax,w

here

asa

Dre

pres

ents

that

they

used

ade

pend

ency

repr

esen

tatio

n.In

the

Trai

nco

lum

nT

repr

esen

tstr

aini

ngda

taan

dD

repr

esen

tsde

velo

pmen

tdat

a.a C

omm

unic

atio

nw

ithA

nder

sB

jork

elun

d.

21

Page 22: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted ... · erence resolution was conducted on the MUC cor-pora, using a decision tree learner, by Soon et al. (2001). Signicant

Part

icip

ant

Posi

tive

Trai

ning

Exa

mpl

esN

egat

ive

Trai

ning

Exa

mpl

esD

ecod

ing

fern

ande

sId

entif

ylik

ely

men

tions

with

anai

mto

gene

rate

high

reca

llus

ing

the

siev

em

etho

dpr

opos

edin

(Lee

etal

.,20

11).

Cre

ate

dire

cted

arcs

betw

een

men

tion

pair

sus

ing

ase

tofr

ules

Aco

nstr

aine

dla

tent

pred

icto

rfind

sth

em

axim

umsc

orin

gdo

cum

entt

ree

amon

gpo

ssib

leca

ndid

ates

bjor

kelu

ndC

lose

stA

ntec

eden

t(So

on,2

001)

Neg

ativ

eex

ampl

esin

betw

een

anap

hora

ndcl

oses

tant

eced

ent(

Soon

,200

1)

Stac

ked

reso

lver

s—

i)B

estfi

rst,

ii)Pr

onou

ncl

oses

tfirs

t—cl

oses

tfirs

tfor

pron

ouns

and

best

first

foro

ther

men

tions

and

iii)c

lust

er-m

entio

n;di

sallo

wtr

ansi

tive

nest

ing;

prop

erno

unm

entio

nspr

oces

sed

first

,fol

low

edby

othe

rno

uns

and

pron

ouns

chen

Rul

e-ba

sed

siev

eap

proa

chfo

llow

edby

heur

istic

and

lear

ning

base

dpr

unin

g

stam

borg

Clo

sest

Ant

eced

ent(

Soon

,200

1)N

egat

ive

exam

ples

inbe

twee

nan

apho

rand

clos

esta

ntec

eden

t(So

on,2

001)

Chi

nese

and

Ara

bic

—C

lose

st-fi

rstc

lust

erin

gfo

rpro

noun

san

dB

est-

first

clus

teri

ngot

herw

ise.

Eng

lish

—cl

oses

t-fir

stfo

rpro

noun

san

dav

erag

edbe

st-fi

rstc

lust

erin

got

herw

ise.

mar

tsch

atW

eigh

tsar

etr

aine

don

part

ofth

etr

aini

ngda

taG

reed

ycl

uste

ring

forE

nglis

h;Sp

ectr

alcl

uste

ring

follo

wed

bygr

eedy

clus

teri

ngfo

rChi

nese

tore

duce

num

bero

fcan

dida

tean

tece

dent

s.

chan

gC

lose

stA

ntec

eden

t(So

on,2

001)

All

prec

edin

gm

entio

nsin

aun

ion

ofof

gold

and

pred

icte

dm

entio

ns.M

entio

nsw

here

the

first

ispr

onou

nan

dot

hern

otar

eno

tco

nsid

ered

Bes

tlin

kst

rate

gy;s

epar

ate

clas

sifie

rsfo

rpro

nom

inal

and

non-

pron

omin

alm

entio

nsfo

rEng

lish.

Sing

lecl

assi

fierf

orC

hine

se.

uryu

pina

Clo

sest

Ant

eced

ent(

Soon

,200

1)N

egat

ive

exam

ples

inbe

twee

nan

apho

rand

clos

esta

ntec

eden

t(So

on,2

001)

men

tion

pair

mod

elw

ithou

tran

king

asin

Soon

2001

zhek

ova

Clo

sest

Ant

eced

ent(

Soon

,200

1)N

egat

ive

exam

ples

inbe

twee

nan

apho

rand

clos

esta

ntec

eden

t(So

on,2

001)

All

defin

iteph

rase

sus

edto

crea

tea

pair

fore

ach

anap

horw

ithea

chm

entio

npr

eced

ing

itw

ithin

aw

indo

wof

10(E

nglis

h,C

hine

se)o

r7(A

rabi

c)se

nten

ces.

li—

Bes

t-fir

stcl

uste

ring

yuan

—D

eter

min

istic

NP-N

Pfo

llow

edby

PP-N

P

xuW

indo

wof

sent

ence

sis

used

tode

term

ine

posi

tive

and

nega

tive

exam

ples

.For

Eng

lish

aw

indo

wof

5se

nten

ces

isus

edw

here

asfo

rChi

nese

aw

indo

wof

10se

nten

ces

isus

edA

ll-pa

irlin

king

follo

wed

bypr

unin

gor

corr

ectio

nus

ing

ase

tofr

ules

for

NE

-NE

and

NP-N

Pm

entio

nsfo

rsen

tenc

esou

tsid

eof

a5/

10se

nten

cew

indo

win

Eng

lish

and

Chi

nese

resp

ectiv

ely

chun

yang

Lee

etal

.,20

11sy

stem

yang

Pre-

clus

terp

airm

odel

sse

para

tefo

reac

hpa

irN

P-N

P,N

P-P

RP

and

PR

P-P

RP

Pre-

clus

ters

,with

sing

leto

npr

onou

npr

e-cl

uste

rs,a

ndus

ecl

oses

t-fir

stcl

uste

ring

.Diff

eren

tlin

km

odel

sba

sed

onth

ety

peof

linki

ngm

entio

ns—

NP-P

RP,

PR

P-P

RP

and

NP-N

P

xinx

inC

lose

stA

ntec

eden

t(So

on,2

001)

Neg

ativ

eex

ampl

esin

betw

een

anap

hora

ndcl

oses

tant

eced

ent(

Soon

,200

1)B

est-

first

clus

teri

ngm

etho

dsh

ouM

odifi

edve

rsio

nof

Lee

etal

.,20

11co

refe

renc

esy

stem

xion

gL

eeet

al.,

2011

syst

em

Tabl

e16

:Par

ticip

atin

gsy

stem

profi

les

—Pa

rtII

.Thi

sfo

cuse

son

the

way

posi

tive

and

nega

tive

exam

ples

wer

ege

nera

ted

and

the

deco

ding

stra

tegy

used

.

22

Page 23: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted ... · erence resolution was conducted on the MUC cor-pora, using a decision tree learner, by Soon et al. (2001). Signicant

terms of individual parse tree NP constituent spans.Some systems consider only mention-specific at-tributes while performing the clustering, but the re-cent trend seems to indicate a shared attribute model,where the attributes of an entity are determined col-lectively by heuristically merging the attribute typesand values of its constituent mentions. For example,if a mention marked singular is clustered with an-other entity marked plural, then the collective num-ber for the entity is assigned to be {singular, plu-ral}. Various types of trained models were usedfor predicting coreference. For a learning-basedsystem, generation of positive and negative exam-ples is very important. The participating systemsused a range of sentence windows surrounding theanaphor in generating these examples. In the sys-tems that used trained models, many systems usedthe approach described in Soon et al. (2001) for se-lecting the positive and negative training examples,while others used some of the alternative approachesthat have been introduced in the literature more re-cently. Following on the success of rule-based link-ing model in the CoNLL-2011 shared task, manysystems used a completely rule-based linking model,or used it as a initializing, or intermediate step in alearning based system. A hybrid approach seemsto be a central theme of many high scoring sys-tems. Also, taking cue from last year’s systems, al-most all systems trained pleonastic it classifiers, andused speaker-based constraints/features for the con-versation genre. Many systems used the predictedArabic parts of speech that were mapped-down toPenn-style parts of speech, but stamborg used someheuristics to convert them back to the complex partof speech type, using more frequent mapping, toget better performance for Arabic. The fernandessystem uses feature templates defined on mentionpairs. bjorkelund mentions that disallowing transi-tive closures gave performance improvement of 0.6and 0.4 respectively for English and Chinese/Arabic.bjorkelund also mentions seeing a considerable in-crease in performance after adding features that cor-respond to the Shortest Edit Script (Myers, 1986)between surface forms and unvocalised Buckwal-ter forms, respectively. These could be better atcapturing the differences in gender and number sig-naled by certain morphemes than hand-crafted rules.chen built upon the sieve architecture proposed inRaghunathan et al. (2010) and added one more sieve— head match — for Chinese and modified twosieves. Some participants tried to incorporate pecu-liarities of the corpus in their systems. For example,martschat excluded adjectival nation names. UnlikeEnglish, and especially in absence of an external re-source, it is hard to make a gender distinction inArabic and Chinese. martschat used the information

that 先先先生(sir) and 女士(lady) often suggest genderinformation. bo and martschat used plurality mark-ers 们 to identify plurals. For example, 同学 (stu-dent) is singular and同学们 (students) is plural. boalso uses a heuristic that if the word和 (and) appearsin the middle of a mention M, and the two parts sep-arated by和 are sub-mentions of M, then mention Mis considered to be plural. Other words which havethe similar meaning of 和, such as 同, 与 and 跟,are also considered. uryupina used the rich part ofspeech tags to classify pronouns into subtype, per-son number and gender. Chinese and Arabic do nothave definite noun phrase markers like the in En-glish. In contrast to English there is no strict en-forcement of using definite noun phrases when re-ferring to an antecedent in Chinese. Both 这次演说 (the talk) and 演说 (talk) can corefer with theantecedent 克林顿在河内大选的演说 (Clinton’stalk during Hanoi election). This makes it very diffi-cult to distinguish generic expressions from referen-tial ones. martschat checks whether the phrase startswith a definite/demonstrative indicator (e.g.,这(this)or 那(that)) in order to identify demonstrative anddefinite noun phrases. For Arabic, uryupina consid-ers as definite all mentions with definite head nouns(prefixed with “Al”) and all the idafa constructs witha definite modifier. chang uses training data to iden-tify inappropriate mention boundaries. They per-form a relaxed matching between predicted men-tions and gold mentions ignoring punctuation marksand mentions that start with one of the following:adverb, verb, determiner, and cardinal number. Inanother extreme, xiong translated Chinese and Ara-bic to English, and ran an English system and pro-jected mentions back to the source languages. Un-fortunately, it did not work quite well by itself. Oneissue that they faced was that many instances of pro-nouns did not have a corresponding mention in thesource language (since we do not consider mentionsformed by dropped subjects/objects). Nevertheless,using this in addition to performing coreference res-olution in these languages could be useful. Similarto last year, most participants appear not to have fo-cused much on eventive coreference, those corefer-ence chains that build off verbs in the data. This usu-ally means that nominal mentions that should havelinked to the eventive verb were instead linked inwith some other entity, or remained unlinked. Par-ticipants may have chosen not to focus on events be-cause they pose unique challenges while making uponly a small portion of the data (Roughly 90% ofmentions in the data are NPs and pronouns). Many ofthe trained systems were also able to improve theirperformance by using feature selection, the detailsvaried depending on the example selection strategyand the classifier used.

23

Page 24: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted ... · erence resolution was conducted on the MUC cor-pora, using a decision tree learner, by Soon et al. (2001). Signicant

7 ResultsIn this section we will take a look at the perfor-

mance overview of various systems and then lookat the performance for each language in various set-tings separately. For the official test, beyond the rawsource text, coreference systems were provided onlywith the predictions for the other annotation layers(parses, semantic roles, word senses, and named en-tities). A high-level summary of the results for thesystems on the primary evaluation for both open andclosed tracks is shown in Table 17. The scores un-der the columns for each language are the averageof MUC, BCUBED and CEAFe for that language. Thecolumn Official Score is the average of those per-language averages, but only for the closed track. If aparticipant did not participate in all three languages,then they got a score of zero for the languages thatwere not attempted. The systems are sorted in de-scending order of this final Official Score. The lasttwo columns indicate whether the systems used onlythe training or both training and development forthe final submissions. Most top performing systemsused both training and development data for trainingthe final system. Note that all the results reportedhere still used the same, predicted information forall input layers.

It can be seen that fernandes got the highest com-bined score (58.69) across all three languages andmetrics. While scores for each individual languageare lower than the figures cited for other corpora, itis as expected, given that the task here includes pre-dicting the underlying mentions and mention bound-aries, the insistence on exact match, and given thatthe relatively easier appositive coreference cases arenot included in this measure. The combined scoreacross all languages is purely for ranking purposes,and does not really tell much about each individuallanguage. Owing to the ordering based on officialscore, not all the best performing systems for a par-ticular language are in sequential order. Therefore,for easier reading, the scores of the top ranking sys-tem are in bold red, and the top four systems areunderlined in the table.

Looking at the the English performance, we cansee that fernandes gets the best average across thethree selected metrics (MUC, BCUBED and CEAFe).The next best system is martschat (61.31) followedvery closely by bjorkelund (61.24) and then chang(60.18). The performance differences between thebetter-scoring systems were not large, with onlyabout three points separating the top four systems,and only six out of a total of sixteen systems gettinga score lower than 58 points which was the highestperforming score in CoNLL-2011.28

In case of Chinese, it is seen that chen performs28More precise comparison later in Section 8.

the best with a score of 62.24. This is then followedby yuan (60.69), and then bjorkelund (59.97) andxu (59.22). It is interesting to note that the scoresfor the top performing systems for both English andChinese are very close. For all we know, this is justa coincidence. Also, for both English and Chinese,the top performing system is almost 2 points higherthan the second best system.

On the Arabic language front, once again, fernan-des has the highest score of 54.22, followed closelyby bjorkelund (53.55) and then uryupina (50.41)

Since the majority of mentions in all the threelanguages are noun phrases or pronouns, the accu-racy with which these are predicted in the parse treesshould directly bear on the coreference scores. Sincepronouns are a closed class and single words, themain focus falls on the accuracy of the noun phrases.By no means is the accuracy of noun phrases theonly factor determining the overall coreference ac-curacy, but it cannot be ignored either. It can be ob-served that the coreference scores for the three lan-guages are in the same trend as the noun phrase ac-curacies for those languages as seen in Table 6. Re-call that in case of both Chinese and Arabic, thereare roughly 11% instances of dropped pronouns thatwere not considered as part of the evaluation. Theperformance for Chinee and Arabic would decreasesomewhat if these were considered in the set of goldmentions (and entities).

Tables 18 and 19 show similar information for thetwo supplementary tasks — one given gold mentionboundaries (GB) and one given correct, gold men-tions (GM). We have however, kept the same rela-tive ordering of the system participants as in Table17 for ease of reading. Looking at Table 18 care-fully, we can see that for English and Arabic the rel-ative ranking of the systems remain almost the same,except for a few outliers: chang performs the bestgiven gold mentions — by almost 7 points over thenext best performing system. In the case of Chinese,chen performs almost 6 points better than the officialperformance given gold boundaries, and another 9points given gold mentions and almost 8 points bet-ter than the next best system using gold mentions.We will look at more details in the following sec-tions.

As mentioned earlier in Section 4.2 we conductedsome supplementary evaluations. These can be cat-egorized by a combination of two parameters. Oneof which applies to both training and test set, andone can only apply to the test set. The two parame-ters are: i) Syntax and ii) Mention Quality. Syntaxcan take two values: i) predicted (PS), or ii) gold(GS), and can be applicable during either training ortest; and, the mention quality can be of three values:i) No boundaries (NB), ii) Gold mention boundaries

24

Page 25: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted ... · erence resolution was conducted on the MUC cor-pora, using a decision tree learner, by Soon et al. (2001). Signicant

Participant Open Closed Official Final model

English Chinese Arabic English Chinese Arabic Score Train Dev

fernandes 63.37 58.49 54.22 58.69√ √

bjorkelund 61.24 59.97 53.55 58.25√ √

chen 63.53 59.69 62.24 47.13 56.35√ ×

stamborg 59.36 56.85 49.43 55.21√ √

uryupina 56.12 53.87 50.41 53.47√ √

zhekova 48.70 44.53 40.57 44.60√ √

li 45.85 46.27 33.53 41.88√ √

yuan 61.02 58.68 60.69 39.79√ √

xu 57.49 59.22 38.90√ ×

martschat 61.31 53.15 38.15√ ×

chunyang 59.24 51.83 37.02 – –yang 55.29 18.43

√ ×chang 60.18 45.71 35.30

√ ×xinxin 48.77 51.76 33.51

√ √shou 58.25 19.42

√ ×xiong 59.23 44.35 44.37 0.00

√ √

Table 17: Performance on primary open and closed tracks using all predicted information.

Participant Open Closed Suppl. Final model

English Chinese Arabic English Chinese Arabic Score Train Dev

fernandes 63.16 61.48 53.90 59.51√ √

bjorkelund 60.75 62.76 53.50 59.00√ √

chen 70.00 60.33 68.55 47.27 58.72√ ×

stamborg 57.35 54.30 49.59 53.75√ √

zhekova 49.30 44.93 40.24 44.82√ √

li 43.04 43.28 31.46 39.26√ √

yuan 59.50 64.42 41.31√ √

xu 56.47 64.08 40.18√ ×

chang 60.89 20.30√ √

Table 18: Performance on supplementary open and closed tracks using all predicted information, given gold mentionboundaries.

Participant Open Closed Suppl. Final model

English Chinese Arabic English Chinese Arabic Score Train Dev

fernandes 69.35 66.36 63.49 66.40√ √

bjorkelund 68.20 69.92 59.14 65.75√ √

chen 78.98 70.46 77.77 52.26 66.83√ ×

stamborg 68.66 66.97 53.35 62.99√ √

zhekova 59.06 51.44 55.72 55.41√ √

li 51.40 59.93 40.62 50.65√ √

yuan 69.88 76.05 48.64√ √

xu 63.46 69.79 44.42√ ×

chang 77.22 25.74√ √

Table 19: Performance on supplementary open and closed tracks using all predicted information, given gold mentions.

25

Page 26: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted ... · erence resolution was conducted on the MUC cor-pora, using a decision tree learner, by Soon et al. (2001). Signicant

Testing Mention Quality No Boundaries Gold Boundaries Gold Mention

20

30

40

50

60

70

80

Arabic Chinese English

fernandes bjorklund

chen stamborg

20

30

40

50

60

70

80

xu yuan chang

li

Figure 4: Performance for eight participating systems for the three languages, across the three mention qualities.

(GB) and iii) Gold mentions (GM), and can only beapplicable during testing (since this information isnot optional during training, as is the case with us-ing gold or predicted syntax). There are a total oftwelve combinations that we can form of using theseparameters. Out of these, we thought six were par-ticularly interesting. This is the product of the threecases of mention quality — NB, GB and GM, and twocases of syntax – GS and PS used during testing.

Figure 4 shows a performance plot for eight par-ticipating systems that attempted both the supple-mentary tasks — GB and GM in addition to the mainNB for at least one of the three languages. Theseare all in the closed setting. At the bottom of theplot you can see dots that indicate what test condi-tion to which a particular point refers. In most cases,for the hardest task — NB — the English and Chi-nese performances track quite close to each other.When provided with gold mention boundaries (GB),systems, chen, xu and yuan do significantly betterin Chinese. There is almost no positive effect onthe English performance across the board. In fact,performance of the stamborg and li systems dropsnoticeably. There is also a drop in performance forthe bjorkelund system, but the difference is probablynot significant. Finally, when provided with goldmentions, the performance of all systems increasesacross all languages, with chang showing the high-est gain for English, and chen showing the highest

gain for Chinese.

Figure 5 is a box and whiskers plot of the per-formance for all the systems for each language andvariations — NB, GB, and GM. The circle in the cen-ter indicates the mean of the performances. The hor-izontal line in between the box indicates the median,and the bottom and top of the boxes indicate the firstand third quartiles respectively, with the whiskers in-dicating the highest and lowest performance on thattask. It can be easily seen that the English systemshave the least divergence, with the divergence largefor the GM case probably owing to chang. This issomewhat expected as this is the second year for theEnglish task, and so it does show a more mature andstable performance. On the other hand, both Chineseand Arabic plots show much more divergence, withthe Chinese and Arabic GB case showing the highestdivergence. Also, except for Chinese GM condition,there is some skewness in the score distribution oneway or the other.

Some participants ran their systems on six ofthe twelve possible combinations for all three lan-guages. Figure 6 shows a plot for these tree par-ticipants — fernandes, bjorkelund, and chen. As inFigure 4, the dots at the bottom help identify whichparticular combination of parameters the point onthe plot represents. In addition to the three testconditions related to mention quality, we now alsohave two more test conditions relating to the syntax.

26

Page 27: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted ... · erence resolution was conducted on the MUC cor-pora, using a decision tree learner, by Soon et al. (2001). Signicant

We can see that the fernandes and bjorkelund, sys-tem performance tracks very close to each other. Inother words, using gold standard parses during test-ing does not show much benefit in those cases. Incase of chen, however, using gold parses shows asignificant jump in scores for the NB condition. Itseems that somehow, chen makes much better useof the gold parses. In fact, the performance is veryclose to the one with the GB condition. It is not clearwhat this system is doing differently that makes thispossible. Adding more information, i.e., the GMcondition, improves the performance by almost thesame delta as going from NB to GB.

Finally, Figure 7 shows the plot for one system —bjorkelund — that was ran on ten of the twelve dif-ferent settings. As usual the dots at the bottom helpidentify the conditions for a point on the plot. Now,there is a condition related to the quality of syntaxduring training as well. For some reasons, usinggold syntax hurts performance — though slightly —in the NB and GB settings. Chinese does show some

improvement when gold parse is used for training,only when gold mentions are available during test-ing.

One point to note is that we cannot comparethese results to the ones obtained in the SEMEVAL-2010 coreference task which used a small portion ofOntoNotes data because it was only using nominalentities, and had heuristically added singleton men-tions29.

29The documentation that comes with the SEMEVAL datapackage from LDC (LDC2011T01) states: “Only nominalmentions and identical (IDENT) types were taken from theOntoNotes coreference annotation, thus excluding coreferencerelations with verbs and appositives. Since OntoNotes is onlyannotated with multi-mention entities, singleton referential ele-ments were identified heuristically: all NPs and possessive de-terminers were annotated as singletons excluding those func-tioning as appositives or as pre-modifiers but for NPs in the pos-sessive case. In coordinated NPs, single constituents as wellas the entire NPs were considered to be mentions. There is noreliable heuristic to automatically detect English expletive pro-nouns, thus they were (although inaccurately) also annotated assingletons.”

0

20

40

60

80

100

Ara

bic

(NB

)

Ara

bic

(GB

)

Ara

bic

(GM

)

Chi

nese

(N

B)

Chi

nese

(G

B)

Chi

nese

(G

M)

Eng

lish

(NB

)

Eng

lish

(GB

)

Eng

lish

(GM

)

Figure 5: A box and whiskers plot of the performance for the three languages across the three mention qualities.

27

Page 28: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted ... · erence resolution was conducted on the MUC cor-pora, using a decision tree learner, by Soon et al. (2001). Signicant

0

10

20

30

40

50

60

70

80

90

100

Arabic Chinese English

Testing Syntax Gold Predicted

Mention Quality No Boundaries Gold Boundaries Gold Mention

fernandes chenbjorklund

Figure 6: Performance of fernandes, bjorkelund and chen over six different settings.

In the following sections we will look at the re-sults for the three languages, in various settings inmore detail. It might help to describe the format ofthe tables first. Given that our choice of the officialmetric was somewhat arbitrary, it is also useful tolook at the individual metrics. The tables are simi-lar in structure to Table 20. Each table provides re-sults across multiple dimensions. For completeness,the tables include the raw precision and recall scoresfrom which the F-scores were derived. Each tableshows the scores for a particular system for the taskof mention detection and coreference resolution sep-arately. The tables also include two additional scores(BLANC and CEAFm) that did not factor into the of-ficial score. Useful further analysis may be possiblebased on these results beyond the preliminary resultspresented here. As you recall, OntoNotes does notcontain any singleton mentions. Owing to this pecu-liar nature of the data, the mention detection scorescannot be interpreted independently of the corefer-ence resolution scores. In this scenario, a mentionis effectively an anaphoric mention that has at leastone other mention coreferent with it in the docu-ment. Most systems removed singletons from theresponse as a post-processing step, so not only will

they not get credit for the singleton entities that theyincorrectly removed from the data, but they will bepenalized for the ones that they accidentally linkedwith another mention. What this number does in-dicate is the ceiling on recall that a system wouldhave got in absence of being penalized for makingmistakes in coreference resolution. The tables aresub-divided into several logical horizontal sectionsseparated by two horizontal lines. There can be atotal of 12 sections, each categorized by a combi-nation of two parse quality features GS and PS foreach training and test set and three variations on themention qualities — NB, GB, and GM, as describedearlier. Just like we used the dots below the graphsearlier to indicate the parameters that were chosenfor a particular point on the plot, we use small blacksquares in the tables after the participant name, toindicate the conditions chosen for the results on thatparticular row. Since there are many rows to eachtable, in order to facilitate finding which number weare referring to, we have added a ID column whichuses letters e, c, and a to refer to the three languages— English, Chinese and Arabic. This is followed bya decimal number, in which the number before thedecimal identifies the logical block within the table

28

Page 29: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted ... · erence resolution was conducted on the MUC cor-pora, using a decision tree learner, by Soon et al. (2001). Signicant

TrainingSyntaxGoldPredicted

TestingSyntaxGoldPredicted

Mention QualityNo BoundariesGold BoundariesGold Mention

Arabic Chinese English

40

50

60

70

80

Figure 7: Performance of bjorkelund over ten different settings.

that share the same experiment parameters, and theone after the decimal indicates the index of a par-ticular system in that block. Systems are sorted bythe official score within each block. All the sys-tems with NB setting are listed first, followed by GB,followed by GM. One participant (bjorkelund) ranmore variations than we had originally planned, butsince it falls under the general permutation and com-bination of the settings that we were considering, itmakes sense to list those results here as well.

7.1 English ClosedTable 20 shows the performance for the English

language in greater detail.

Official Setting Recall is quite important in themention detection stage because the full coreferencesystem has no way to recover if the mention de-tection stage misses a potentially anaphoric men-tion. The linking stage indirectly impacts the finalmention detection accuracy. After a complete passthrough the system some correct mentions could re-main unlinked with any other mentions and wouldbe deleted thereby lowering recall. Most systemstend to get a close balance between recall and preci-sion for the mention detection task. A few systemshad a considerable gap between the final mentiondetection recall and precision (fernandes, xu, yang,li and xinxin). It is not clear why this might be thecase. One commonality between the ones that hada much higher precision than recall was that they

used machine learned classifiers for mention detec-tion. This could be possible because any classifierthat is trained will not normally contain singletonmentions (as none have been annotated in the data)unless one explicitly adds them to the set of train-ing examples (which is not mentioned in any of therespective system papers). A hybrid rule-based andmachine learned model (fernandes) performed thebest. Apart from some local differences, the rank-ing for all the systems is roughly the same irrespec-tive of which metric is chosen. The CEAFe mea-sure seems to penalize systems more harshly thanthe other measures. If the CEAFe measure does in-dicate the accuracy of entities in the response, thissuggests that fernandes is doing better on getting co-herent entities than any other system.

Gold Mention Boundaries In this case, all possi-ble mention boundaries are provided to the system.This is very similar to what annotators see whenthey annotate the documents. One difficulty withthis supplementary evaluation is that these bound-aries alone provide only very partial information.For the roughly 10 to 20% of mentions that the auto-matic parser did not correctly identify, while the sys-tems knew the correct boundaries, they had no struc-tural syntactic or semantic information, and theyalso had to further approximate the already heuris-tic head word identification. This incomplete datacomplicates the systems’ task and also complicatesinterpretation of the results. While most systems did

29

Page 30: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted ... · erence resolution was conducted on the MUC cor-pora, using a decision tree learner, by Soon et al. (2001). Signicant

slightly better here in terms of raw scores, the per-formance was not much different from the officialtask, indicating that mention boundary errors result-ing from problems in parsing do not contribute sig-nificantly to the final output.30

Gold Mentions Another supplementary conditionthat we explored was if the systems were suppliedwith the manually-annotated spans for all and onlythose mentions that did participate in the gold stan-dard coreference chains. This supplies significantlymore information than the previous case, where ex-act spans were supplied for all NPs, since the goldmentions will also include verb headwords that arelinked to event NPs, and will not include singletonmentions, which do not end up as part of any chain.The latter constraint makes this test seem artificial,since it directly reveals part of what the systems aredesigned to determine, but it still has some value inquantifying the impact that mention detection andanaphoricity determination has on the overall taskand what the results are if they are perfectly known.The results show that performance does go up signif-icantly, indicating that it is markedly easier for thesystems to generate better entities given gold men-tions. Although, ideally, one would expect a perfectmention detection score, it is the case that many ofthe systems did not get a 100% recall. This couldpossibly be owing to unlinked singletons that wereremoved in post-processing. chang along with fer-nandes are the only systems that got a perfect 100%recall. The reason is most likely because they hada hard constraint to link all mentions with at leastone other mention. chang (77.22 [e7.00]) stands outin that it has a 7 point lead on the next best sys-tem in this category. This indicates that the link-ing algorithm for this system is significantly superiorthan the other systems — especially since the perfor-mance of the only other system that gets 100% men-tion score (fernandes) is much lower (69.35 [e7.03])

Gold Test Parses Looking at Table 20 it can beseen that there is a slight increase (∼1 point) in per-formance across all the systems when gold parsesacross all settings — NB, GB, and GM. In the caseof bjorkelund for the NB setting, the overall perfor-mance improves by a percent when using gold testparse during testing (61.24 [e0.02] vs 62.23 [e1.02]),but strangely if gold parses are used during train-ing as well, the performance is slightly lower (61.71[e3.00]), although this difference is probably not sta-tistically significant.

30It would be interesting to measure the overlap between theentity clusters for these two cases, to see whether there wasany substantial difference in the mention chains, besides the ex-pected differences in boundaries for individual mentions.

7.2 Chinese ClosedTable 21 shows the performance for the Chinese

language in greater detail.

7.2.1 Official SettingIn this case, it turns out that chen does about 2

points better than the next best system across all themetrics. We know that this system had some moreChinese-specific improvements. It is strange thatfernandes has a much lower mention recall with amuch higher precision as compared to chen. As faras the system descriptions go, both systems seem tohave used the same set of mentions — except forchen including QP phrases and not considering inter-rogative pronouns. One thing we found about chenwas that they dealt with nested NPs differently incase of the NW genre to achieve some performanceimprovement. This unfortunately seems to be ad-dressing a quirk in the Chinese newswire data owingto a possible data inconsistency in the release.

7.2.2 Gold Mention BoundariesUnlike English, just the addition of gold mention

boundaries improves the performance of almost allsystems significantly. The delta improvement forfernandes turns out to be small, but it does gain onthe mention recall as compared to the NB case. Itis not clear why this might be the case. One ex-planation could be that the parser performance forconstituents that represent mentions — primarily NPmight be significantly worse than that for English.The mention recall of all the systems is boosted byroughly 10%.

7.2.3 Gold MentionsProviding gold mention information further sig-

nificantly boosts all systems. More so is the casewith chen [e8.00] which gains another 9 points overthe gold mention boundary condition in spite of thefact that they don’t have a perfect recall. On theother hand, fernandes gets a perfect mention recalland precision, but ends up getting a 11 point lowerperformance [c8.05] than chen. Another thing tonote is that for the CEAFe metric, the incrementaldrop in performance from the best to the next bestand so on, is substantial, with a difference of 17points between chen and fernandes. It does seemthat the chen and yuan algorithm for linking is muchbetter than the others.

7.2.4 Gold Test ParsesWhen provided with gold parses for the test set,

there is a substantial increase in performance for theNB condition – numerically more so than in case ofEnglish. The degree of improvement decreases forthe GB and GM conditions.

30

Page 31: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted ... · erence resolution was conducted on the MUC cor-pora, using a decision tree learner, by Soon et al. (2001). Signicant

IDPa

rtic

ipan

tTr

ain

Test

ME

NT

ION

DE

TE

CT

ION

CO

RE

FE

RE

NC

ER

ESO

LU

TIO

NO

ffici

alSy

ntax

Synt

axM

entio

nQ

lty.

MU

CB

CU

BE

DC

EA

F mC

EA

F eB

LA

NC

AG

AG

NB

GB

GM

RP

FR

PF 1

RP

F 2R

PF

RP

F 3R

PF

F1+

F2+

F3

3

e0.0

0fe

rnan

des

��

�72

.75

83.4

577

.73

65.8

375

.91

70.5

165

.79

77.6

971

.24

61.9

461

.94

61.9

455

.00

43.1

748

.37

77.3

578

.07

77.7

063

.37

e0.0

1m

arts

chat

��

�74

.23

76.1

075

.15

65.2

168

.83

66.9

766

.50

74.6

970

.36

59.6

159

.61

59.6

148

.64

44.7

246

.60

73.2

978

.94

75.7

361

.31

e0.0

2bj

orke

lund

��

�73

.75

77.0

975

.38

65.2

370

.10

67.5

865

.90

75.2

470

.26

59.2

059

.20

59.2

048

.60

43.4

245

.87

73.4

778

.81

75.8

061

.24

e0.0

3ch

ang

��

�72

.48

76.2

574

.32

64.7

768

.06

66.3

867

.04

71.8

169

.34

58.2

858

.28

58.2

846

.64

43.1

144

.81

75.2

474

.73

74.9

860

.18

e0.0

4ch

en�

��

75.0

872

.60

73.8

263

.47

63.9

663

.71

66.5

771

.52

68.9

658

.13

58.1

358

.13

46.6

846

.15

46.4

171

.30

79.1

574

.48

59.6

9e0

.05

stam

borg

��

�75

.51

72.3

973

.92

66.2

663

.98

65.1

069

.09

69.5

469

.31

56.7

656

.76

56.7

642

.53

44.8

943

.68

74.0

377

.28

75.5

259

.36

e0.0

6ch

unya

ng�

��

75.2

372

.24

73.7

164

.08

63.5

763

.82

66.4

570

.71

68.5

157

.24

57.2

457

.24

45.1

345

.67

45.4

071

.12

77.9

273

.95

59.2

4e0

.07

yuan

��

�73

.19

71.8

872

.53

62.0

863

.02

62.5

566

.23

70.4

568

.27

57.2

857

.28

57.2

845

.74

44.7

445

.23

72.0

576

.86

74.1

658

.68

e0.0

8sh

ou�

��

75.3

572

.08

73.6

863

.46

62.3

962

.92

65.3

168

.90

67.0

555

.68

55.6

855

.68

44.2

045

.35

44.7

769

.43

75.0

871

.81

58.2

5e0

.09

xu�

��

62.6

484

.55

71.9

659

.11

75.1

866

.18

62.2

873

.29

67.3

453

.19

53.1

953

.19

48.2

932

.64

38.9

575

.16

68.3

770

.95

57.4

9e0

.10

uryu

pina

��

�71

.82

69.9

670

.88

61.0

060

.78

60.8

963

.59

68.4

865

.95

52.4

452

.44

52.4

441

.42

41.6

441

.53

67.4

072

.83

69.6

556

.12

e0.1

1ya

ng�

��

64.2

873

.99

68.7

955

.29

65.2

059

.84

60.3

872

.74

65.9

951

.90

51.9

051

.90

45.2

035

.95

40.0

568

.69

71.6

670

.03

55.2

9e0

.12

xinx

in�

��

73.9

354

.55

62.7

855

.48

42.7

248

.27

66.9

356

.67

61.3

744

.83

44.8

344

.83

31.6

843

.55

36.6

864

.77

66.1

465

.42

48.7

7e0

.13

zhek

ova

��

�65

.78

68.4

967

.11

54.2

852

.79

53.5

262

.26

54.9

058

.35

43.5

443

.54

43.5

433

.52

34.9

634

.22

67.2

358

.77

60.6

348

.70

e0.1

4li

��

�45

.78

86.7

259

.93

39.1

272

.57

50.8

443

.03

80.0

655

.98

41.9

741

.97

41.9

749

.44

22.3

030

.74

64.0

166

.86

65.2

445

.85

e1.0

0fe

rnan

des

��

�74

.76

84.8

179

.47

67.7

377

.25

72.1

866

.42

78.0

171

.75

63.0

563

.05

63.0

556

.16

44.5

149

.66

76.8

978

.60

77.7

164

.53

e1.0

1m

arts

chat

��

�76

.87

77.3

377

.10

67.9

070

.37

69.1

167

.83

75.3

471

.39

60.9

260

.92

60.9

249

.38

46.6

147

.96

74.0

779

.77

76.5

462

.82

e1.0

2bj

orke

lund

��

�75

.53

77.8

676

.68

67.0

071

.17

69.0

266

.56

75.7

170

.84

59.9

059

.90

59.9

049

.22

44.6

846

.84

73.4

280

.19

76.2

762

.23

e1.0

3ch

en�

��

77.1

375

.15

76.1

366

.30

66.9

966

.65

67.7

372

.77

70.1

659

.70

59.7

059

.70

47.9

947

.21

47.6

072

.63

80.0

875

.70

61.4

7e1

.04

zhek

ova

��

�66

.05

69.6

267

.79

54.4

553

.59

54.0

261

.66

55.6

258

.48

43.7

143

.71

43.7

133

.82

34.6

534

.23

66.8

459

.27

61.2

048

.91

e2.0

0bj

orke

lund

��

�76

.23

72.6

474

.39

66.5

065

.22

65.8

568

.45

71.0

169

.71

58.3

558

.35

58.3

544

.85

46.2

145

.52

74.3

677

.29

75.7

260

.36

e3.0

0bj

orke

lund

��

�78

.68

73.8

376

.18

68.9

666

.73

67.8

369

.70

71.4

770

.58

59.5

559

.55

59.5

545

.55

47.9

746

.73

74.9

877

.92

76.3

561

.71

e4.0

0fe

rnan

des

��

�71

.91

85.2

978

.03

64.9

277

.53

70.6

764

.25

78.9

570

.85

61.5

961

.59

61.5

956

.48

41.6

947

.97

76.2

877

.87

77.0

563

.16

e4.0

1ch

ang

��

�72

.01

79.8

375

.72

64.5

871

.36

67.8

066

.07

73.8

769

.75

58.5

158

.51

58.5

149

.14

41.7

145

.12

75.9

273

.95

74.8

860

.89

e4.0

2bj

orke

lund

��

�71

.95

78.9

775

.30

63.4

471

.63

67.2

963

.95

76.5

969

.70

58.4

858

.48

58.4

850

.00

41.3

545

.27

73.0

578

.99

75.5

960

.75

e4.0

3ch

en�

��

74.7

875

.70

75.2

463

.26

66.7

864

.97

65.3

873

.56

69.2

358

.57

58.5

758

.57

48.8

144

.92

46.7

971

.85

80.2

275

.20

60.3

3e4

.04

yuan

��

�75

.73

70.8

473

.20

64.5

062

.79

63.6

468

.53

69.9

969

.25

57.9

057

.90

57.9

044

.73

46.5

345

.61

72.8

676

.93

74.6

959

.50

e4.0

5st

ambo

rg�

��

76.4

667

.30

71.5

966

.16

58.8

062

.26

71.2

164

.98

67.9

555

.20

55.2

055

.20

38.4

645

.83

41.8

376

.19

73.5

874

.80

57.3

5e4

.06

xu�

��

66.3

175

.82

70.7

462

.15

67.1

364

.55

67.5

366

.97

67.2

552

.02

52.0

252

.02

40.0

435

.45

37.6

077

.29

67.4

270

.74

56.4

7e4

.07

zhek

ova

��

�66

.45

70.9

168

.61

54.9

654

.67

54.8

261

.85

55.6

058

.56

43.9

643

.96

43.9

634

.38

34.6

734

.53

68.4

959

.51

61.5

249

.30

e4.0

8li

��

�60

.00

44.4

751

.08

44.1

733

.66

38.2

166

.37

53.9

359

.51

39.3

039

.30

39.3

027

.53

36.5

131

.39

63.2

660

.00

61.3

343

.04

e5.0

0fe

rnan

des

��

�72

.69

86.1

178

.83

65.6

578

.26

71.4

064

.36

79.0

970

.97

62.0

062

.00

62.0

057

.36

42.2

348

.65

75.8

978

.28

77.0

263

.67

e5.0

1bj

orke

lund

��

�73

.24

79.2

576

.13

64.7

572

.29

68.3

164

.62

77.0

170

.27

59.2

059

.20

59.2

050

.45

42.3

646

.05

73.2

280

.28

76.1

761

.54

e5.0

2ch

en�

��

75.5

976

.58

76.0

864

.67

68.0

766

.33

66.0

974

.02

69.8

359

.38

59.3

859

.38

49.3

645

.55

47.3

872

.09

80.4

175

.43

61.1

8e5

.03

stam

borg

��

�77

.17

67.1

471

.81

66.8

858

.82

62.5

971

.55

64.9

768

.10

55.3

055

.30

55.3

038

.28

46.3

441

.93

76.1

673

.62

74.8

157

.54

e5.0

4zh

ekov

a�

��

65.8

271

.72

68.6

554

.68

55.5

155

.09

61.2

256

.59

58.8

243

.90

43.9

043

.90

34.8

534

.04

34.4

468

.10

59.7

661

.79

49.4

5e6

.00

bjor

kelu

nd�

��

76.2

675

.47

75.8

666

.55

68.0

067

.27

67.6

073

.05

70.2

259

.04

59.0

459

.04

46.9

945

.42

46.1

974

.60

78.3

876

.32

61.2

3e7

.00

chan

g�

��

100.

0010

0.00

100.

0083

.16

88.4

885

.74

75.3

679

.69

77.4

673

.76

73.7

673

.76

75.3

862

.71

68.4

681

.23

78.9

980

.05

77.2

2e7

.01

chen

��

�80

.82

100.

0089

.39

72.2

989

.40

79.9

464

.60

85.9

273

.75

68.3

568

.35

68.3

576

.25

46.4

057

.69

75.4

184

.80

79.1

270

.46

e7.0

2yu

an�

��

80.0

310

0.00

88.9

172

.22

89.1

679

.80

64.7

584

.68

73.3

967

.76

67.7

667

.76

74.4

945

.46

56.4

675

.60

83.1

078

.69

69.8

8e7

.03

fern

ande

s�

��

100.

0010

0.00

100.

0070

.69

91.2

179

.65

65.4

685

.61

74.1

967

.64

67.6

467

.64

74.7

142

.55

54.2

279

.41

80.1

779

.78

69.3

5e7

.04

stam

borg

��

�78

.17

100.

0087

.74

71.2

288

.12

78.7

764

.75

83.1

672

.81

66.7

466

.74

66.7

471

.94

43.7

454

.41

78.6

881

.47

79.9

968

.66

e7.0

5bj

orke

lund

��

�75

.69

100.

0086

.16

69.5

190

.71

78.7

062

.37

87.0

372

.67

66.3

266

.32

66.3

274

.14

41.5

253

.23

76.4

982

.90

79.2

268

.20

e7.0

6xu

��

�67

.09

100.

0080

.30

64.8

189

.94

75.3

462

.88

80.5

770

.63

59.1

359

.13

59.1

365

.28

33.6

744

.42

78.6

072

.37

74.8

763

.46

e7.0

7zh

ekov

a�

��

78.5

510

0.00

87.9

868

.38

78.1

172

.92

63.0

458

.60

60.7

450

.35

50.3

550

.35

52.6

437

.10

43.5

369

.87

61.1

962

.74

59.0

6e7

.08

li�

��

57.1

899

.99

72.7

548

.04

81.5

160

.45

44.1

381

.21

57.1

847

.82

47.8

247

.82

61.8

225

.98

36.5

865

.26

69.9

367

.12

51.4

0e8

.00

chen

��

�81

.46

100.

0089

.78

73.2

289

.69

80.6

265

.32

85.8

774

.20

68.8

768

.87

68.8

776

.43

47.2

658

.40

75.6

684

.84

79.3

171

.07

e8.0

1fe

rnan

des

��

�10

0.00

100.

0010

0.00

71.1

891

.24

79.9

765

.81

85.5

174

.38

67.8

367

.83

67.8

374

.93

43.0

954

.72

79.6

580

.31

79.9

769

.69

e8.0

2st

ambo

rg�

��

78.8

310

0.00

88.1

671

.96

88.4

079

.33

65.3

183

.28

73.2

167

.26

67.2

667

.26

72.2

944

.48

55.0

778

.92

81.5

980

.17

69.2

0e8

.03

bjor

kelu

nd�

��

76.9

410

0.00

86.9

770

.66

91.1

579

.60

62.9

087

.56

73.2

166

.90

66.9

066

.90

74.8

842

.65

54.3

576

.42

84.4

379

.71

69.0

5e8

.04

zhek

ova

��

�78

.73

100.

0088

.10

68.5

478

.10

73.0

163

.14

58.6

360

.80

50.6

150

.61

50.6

152

.84

37.4

443

.83

70.2

861

.55

63.2

359

.21

e9.0

0bj

orke

lund

��

�79

.55

100.

0088

.61

72.5

990

.04

80.3

864

.94

85.7

273

.89

68.1

468

.14

68.1

474

.77

45.2

756

.40

78.2

383

.31

80.4

870

.22

Tabl

e20

:Per

form

ance

ofsy

stem

sin

the

prim

ary

and

supp

lem

enta

ryev

alua

tions

fort

hecl

osed

trac

kfo

rEng

lish.

31

Page 32: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted ... · erence resolution was conducted on the MUC cor-pora, using a decision tree learner, by Soon et al. (2001). Signicant

IDPa

rtic

ipan

tTr

ain

Test

ME

NT

ION

DE

TE

CT

ION

CO

RE

FE

RE

NC

ER

ESO

LU

TIO

NO

ffici

alSy

ntax

Synt

axM

entio

nQ

lty.

MU

CB

CU

BE

DC

EA

F mC

EA

F eB

LA

NC

AG

AG

NB

GB

GM

RP

FR

PF 1

RP

F 2R

PF

RP

F 3R

PF

F1+

F2+

F3

3

c0.0

0ch

en�

��

71.1

272

.16

71.6

459

.92

64.6

962

.21

69.7

377

.81

73.5

562

.18

62.1

862

.18

53.4

348

.73

50.9

772

.79

84.5

377

.34

62.2

4c0

.01

yuan

��

�72

.75

64.0

968

.15

62.3

658

.42

60.3

373

.12

72.6

772

.90

59.5

959

.59

59.5

947

.10

50.7

048

.83

73.7

278

.22

75.7

660

.69

c0.0

2bj

orke

lund

��

�69

.45

63.5

466

.37

58.7

258

.49

58.6

171

.23

75.0

773

.10

59.0

159

.01

59.0

148

.09

48.2

948

.19

72.2

581

.61

76.0

759

.97

c0.0

3xu

��

�64

.33

66.1

065

.20

55.0

261

.47

58.0

768

.39

76.3

872

.16

57.4

657

.46

57.4

650

.40

44.8

147

.44

71.6

474

.57

73.0

059

.22

c0.0

4fe

rnan

des

��

�57

.24

78.2

866

.13

52.6

970

.58

60.3

462

.99

80.5

770

.70

57.7

357

.73

57.7

353

.75

37.8

844

.44

75.0

679

.59

77.1

158

.49

c0.0

5st

ambo

rg�

��

57.6

571

.93

64.0

152

.56

64.1

357

.77

64.4

377

.55

70.3

855

.57

55.5

755

.57

47.9

038

.04

42.4

172

.74

77.8

475

.00

56.8

5c0

.06

uryu

pina

��

�50

.98

70.0

959

.03

45.6

263

.13

52.9

759

.17

80.7

868

.31

52.4

052

.40

52.4

048

.47

34.5

240

.32

68.7

280

.76

73.1

153

.87

c0.0

7m

arts

chat

��

�48

.49

74.0

258

.60

42.7

167

.80

52.4

155

.37

85.2

467

.13

51.3

051

.30

51.3

051

.81

32.4

639

.92

63.9

682

.81

69.1

853

.15

c0.0

8ch

unya

ng�

��

61.1

162

.12

61.6

150

.02

49.6

449

.83

65.8

165

.50

65.6

649

.88

49.8

849

.88

39.8

440

.17

40.0

067

.12

65.8

366

.45

51.8

3c0

.09

xinx

in�

��

55.6

856

.09

55.8

947

.64

48.5

548

.09

66.1

670

.60

68.3

149

.92

49.9

249

.92

39.2

638

.53

38.8

969

.48

73.9

171

.44

51.7

6c0

.10

li�

��

36.6

087

.01

51.5

332

.48

71.4

444

.65

45.5

186

.06

59.5

445

.70

45.7

045

.70

55.1

125

.24

34.6

264

.99

76.6

368

.92

46.2

7c0

.11

chan

g�

��

39.4

359

.97

47.5

830

.85

49.2

237

.93

53.0

278

.31

63.2

344

.89

44.8

944

.89

44.9

229

.99

35.9

759

.82

71.2

463

.16

45.7

1c0

.12

zhek

ova

��

�35

.12

72.5

247

.32

31.1

957

.97

40.5

649

.49

77.6

560

.45

41.8

641

.86

41.8

645

.92

25.2

432

.58

64.2

961

.64

62.7

944

.53

c1.0

0ch

en�

��

83.2

279

.25

81.1

972

.14

72.7

772

.46

75.3

280

.20

77.6

868

.67

68.6

768

.67

58.3

757

.64

58.0

076

.48

87.0

780

.80

69.3

8c1

.01

yuan

��

�83

.69

70.3

376

.43

73.7

365

.50

69.3

877

.97

74.3

776

.13

64.7

764

.77

64.7

749

.99

58.3

853

.86

77.1

279

.84

78.4

166

.46

c1.0

2xu

��

�75

.02

73.0

774

.03

65.9

569

.28

67.5

873

.28

77.6

575

.40

62.5

362

.53

62.5

353

.90

50.7

252

.26

76.5

075

.17

75.8

165

.08

c1.0

3bj

orke

lund

��

�77

.77

68.0

372

.57

66.7

663

.36

65.0

174

.21

75.8

475

.02

62.1

762

.17

62.1

749

.74

52.9

251

.28

74.3

882

.36

77.7

763

.77

c1.0

4fe

rnan

des

��

�63

.83

81.7

371

.68

59.3

574

.49

66.0

766

.31

81.4

373

.10

61.1

961

.19

61.1

955

.97

41.5

047

.66

78.1

181

.28

79.6

062

.28

c1.0

5st

ambo

rg�

��

63.0

775

.21

68.6

158

.32

67.9

562

.76

67.4

778

.54

72.5

858

.98

58.9

858

.98

49.8

041

.15

45.0

676

.29

80.0

978

.05

60.1

3c1

.06

uryu

pina

��

�56

.61

71.3

863

.14

50.7

464

.53

56.8

161

.78

80.1

169

.76

54.5

954

.59

54.5

948

.94

37.4

942

.45

70.3

780

.87

74.4

256

.34

c1.0

7m

arts

chat

��

�52

.94

77.1

762

.80

47.3

072

.19

57.1

557

.31

86.6

969

.00

53.9

253

.92

53.9

254

.11

34.4

042

.06

65.3

984

.86

70.9

556

.07

c1.0

8ch

unya

ng�

��

67.5

265

.33

66.4

156

.11

52.8

154

.41

68.2

764

.49

66.3

351

.92

51.9

251

.92

40.3

943

.47

41.8

869

.16

65.0

366

.80

54.2

1c1

.09

xinx

in�

��

62.5

958

.21

60.3

253

.49

50.4

451

.92

68.7

069

.78

69.2

451

.95

51.9

551

.95

39.6

542

.18

40.8

771

.40

74.6

672

.90

54.0

1c1

.10

yang

��

�58

.66

56.5

257

.57

48.7

449

.49

49.1

165

.61

72.9

169

.07

49.6

149

.61

49.6

140

.11

39.5

239

.81

64.5

173

.91

67.9

552

.66

c1.1

1li

��

�40

.34

89.7

455

.66

34.8

573

.77

47.3

346

.00

86.0

659

.95

46.7

946

.79

46.7

956

.58

25.7

735

.42

66.0

177

.39

69.9

647

.57

c1.1

2ch

ang

��

�43

.28

62.1

651

.03

33.3

650

.29

40.1

153

.96

77.6

063

.65

45.9

445

.94

45.9

445

.06

31.1

036

.80

60.4

071

.57

63.7

946

.85

c1.1

3zh

ekov

a�

��

37.8

474

.84

50.2

733

.95

60.2

943

.44

50.9

577

.28

61.4

143

.34

43.3

443

.34

46.6

826

.13

33.5

065

.98

62.1

563

.73

46.1

2c2

.00

bjor

kelu

nd�

��

74.8

655

.07

63.4

661

.06

48.9

254

.32

76.0

867

.35

71.4

557

.39

57.3

957

.39

42.3

953

.10

47.1

573

.42

77.9

675

.48

57.6

4c3

.00

bjor

kelu

nd�

��

85.2

959

.87

70.3

670

.92

54.0

761

.36

79.5

167

.83

73.2

160

.80

60.8

060

.80

43.6

159

.31

50.2

676

.26

78.9

477

.53

61.6

1c4

.00

chen

��

�81

.99

78.9

780

.45

70.7

672

.12

71.4

374

.37

79.9

177

.04

67.8

767

.87

67.8

757

.95

56.4

157

.17

75.9

586

.75

80.3

268

.55

c4.0

1yu

an�

��

82.8

966

.86

74.0

272

.12

61.5

966

.44

77.9

672

.30

75.0

262

.96

62.9

662

.96

47.1

757

.47

51.8

175

.77

78.4

877

.05

64.4

2c4

.02

xu�

��

73.2

172

.68

72.9

463

.54

68.7

366

.03

71.3

678

.70

74.8

561

.57

61.5

761

.57

53.9

049

.07

51.3

772

.90

77.0

674

.80

64.0

8c4

.03

bjor

kelu

nd�

��

71.8

870

.18

71.0

261

.69

65.5

463

.56

70.4

279

.14

74.5

261

.32

61.3

261

.32

52.0

348

.51

50.2

072

.44

85.4

977

.40

62.7

6c4

.04

fern

ande

s�

��

64.2

179

.18

70.9

158

.76

71.4

664

.49

66.6

279

.88

72.6

560

.40

60.4

060

.40

54.0

942

.02

47.2

977

.69

80.9

679

.22

61.4

8c4

.05

stam

borg

��

�71

.21

55.2

862

.24

61.1

347

.20

53.2

775

.47

63.0

968

.73

52.8

252

.82

52.8

235

.81

47.7

140

.91

74.5

371

.12

72.6

954

.30

c4.0

6zh

ekov

a�

��

36.9

773

.98

49.3

032

.09

58.3

041

.39

49.4

377

.38

60.3

242

.09

42.0

942

.09

46.3

525

.71

33.0

763

.41

61.4

762

.34

44.9

3c4

.07

li�

��

68.2

241

.88

51.9

052

.56

30.6

238

.70

76.1

548

.52

59.2

741

.06

41.0

641

.06

25.1

443

.51

31.8

667

.82

58.7

061

.47

43.2

8c5

.00

chen

��

�83

.22

79.2

581

.19

72.1

472

.77

72.4

675

.32

80.2

077

.68

68.6

768

.67

68.6

758

.37

57.6

458

.00

76.4

887

.07

80.8

069

.38

c5.0

1yu

an�

��

83.7

567

.46

74.7

273

.25

62.4

967

.44

78.4

672

.68

75.4

663

.52

63.5

263

.52

47.5

458

.13

52.3

076

.10

78.7

977

.37

65.0

7c5

.02

bjor

kelu

nd�

��

76.5

370

.28

73.2

765

.46

65.8

765

.66

72.7

978

.38

75.4

962

.91

62.9

162

.91

52.2

151

.82

52.0

173

.70

85.5

278

.36

64.3

9c5

.03

fern

ande

s�

��

63.8

381

.73

71.6

859

.35

74.4

966

.07

66.3

181

.43

73.1

061

.19

61.1

961

.19

55.9

741

.50

47.6

678

.11

81.2

879

.60

62.2

8c5

.04

stam

borg

��

�71

.36

55.3

262

.32

61.1

747

.27

53.3

375

.64

63.4

168

.99

53.2

653

.26

53.2

636

.11

48.0

541

.23

75.7

572

.53

74.0

254

.52

c5.0

5zh

ekov

a�

��

37.8

974

.79

50.3

033

.93

60.1

943

.39

50.8

777

.27

61.3

543

.29

43.2

943

.29

46.6

226

.13

33.4

965

.91

62.3

163

.81

46.0

8c5

.06

li�

��

74.1

040

.23

52.1

456

.15

28.5

037

.81

79.0

043

.67

56.2

539

.32

39.3

239

.32

22.4

945

.72

30.1

568

.12

57.0

859

.83

41.4

0c6

.00

bjor

kelu

nd�

��

84.0

961

.19

70.8

369

.85

55.7

161

.98

78.5

769

.93

74.0

061

.52

61.5

261

.52

45.2

358

.43

50.9

975

.38

81.3

378

.02

62.3

2c7

.00

bjor

kelu

nd�

��

81.0

710

0.00

89.5

572

.72

92.2

881

.34

69.4

791

.38

78.9

372

.24

72.2

472

.24

81.3

252

.45

63.7

777

.30

89.4

082

.09

74.6

8c8

.00

chen

��

�84

.72

100.

0091

.73

76.6

092

.41

83.7

772

.95

91.4

381

.15

75.8

375

.83

75.8

383

.56

57.8

668

.38

79.2

191

.38

84.0

977

.77

c8.0

1yu

an�

��

81.7

299

.79

89.8

574

.77

92.7

482

.79

70.9

191

.21

79.7

973

.67

73.6

773

.67

81.9

854

.65

65.5

877

.48

88.1

481

.81

76.0

5c8

.02

bjor

kelu

nd�

��

71.6

310

0.00

83.4

765

.36

93.2

376

.85

64.4

493

.51

76.3

068

.30

68.3

068

.30

78.5

944

.24

56.6

175

.77

91.5

681

.56

69.9

2c8

.03

xu�

��

71.5

110

0.00

83.3

865

.63

94.0

777

.32

65.0

591

.23

75.9

566

.22

66.2

266

.22

78.1

343

.77

56.1

173

.98

79.7

176

.48

69.7

9c8

.04

stam

borg

��

�68

.97

100.

0081

.63

63.5

288

.23

73.8

663

.54

88.1

273

.84

65.6

065

.60

65.6

072

.56

42.0

153

.21

76.9

683

.70

79.8

966

.97

c8.0

5fe

rnan

des

��

�10

0.00

100.

0010

0.00

61.6

490

.81

73.4

363

.55

89.4

374

.30

65.1

065

.10

65.1

072

.78

39.6

851

.36

80.2

183

.39

81.7

166

.36

c8.0

6li

��

�63

.59

99.9

577

.73

55.2

882

.28

66.1

355

.95

82.9

966

.84

57.5

057

.50

57.5

066

.76

36.0

646

.83

70.5

377

.67

73.4

759

.93

c8.0

7zh

ekov

a�

��

47.5

310

0.00

64.4

342

.02

79.5

755

.00

50.2

280

.81

61.9

446

.88

46.8

846

.88

60.2

727

.08

37.3

768

.60

63.6

265

.58

51.4

4c9

.00

chen

��

�85

.92

100.

0092

.42

77.8

892

.85

84.7

174

.02

91.6

781

.91

76.7

676

.76

76.7

684

.33

59.4

569

.74

79.6

391

.55

84.4

578

.79

c9.0

1yu

an�

��

82.5

899

.80

90.3

875

.69

93.0

683

.48

71.6

291

.39

80.3

074

.23

74.2

374

.23

82.4

055

.60

66.4

077

.77

88.3

082

.06

76.7

3c9

.02

bjor

kelu

nd�

��

75.9

310

0.00

86.3

268

.94

93.7

679

.46

66.8

893

.48

77.9

770

.30

70.3

070

.30

80.5

347

.73

59.9

376

.06

91.1

181

.66

72.4

5c9

.03

stam

borg

��

�69

.08

100.

0081

.71

63.5

288

.24

73.8

763

.56

88.5

674

.00

65.8

965

.89

65.8

972

.93

42.2

253

.48

77.4

385

.65

80.9

167

.12

c9.0

4fe

rnan

des

��

�10

0.00

100.

0010

0.00

61.7

091

.45

73.6

963

.57

89.7

674

.43

65.0

665

.06

65.0

672

.84

39.4

951

.21

80.0

883

.21

81.5

566

.44

c9.0

5li

��

�68

.68

100.

0081

.43

58.8

881

.04

68.2

058

.37

80.2

567

.59

58.7

458

.74

58.7

466

.44

38.8

549

.03

71.3

176

.92

73.7

261

.61

c9.0

6zh

ekov

a�

��

48.8

210

0.00

65.6

144

.12

80.8

957

.10

51.7

980

.53

63.0

447

.84

47.8

447

.84

60.3

727

.69

37.9

670

.49

64.0

666

.45

52.7

0

Tabl

e21

:Per

form

ance

ofsy

stem

sin

the

prim

ary

and

supp

lem

enta

ryev

alua

tions

fort

hecl

osed

trac

kfo

rChi

nese

.

32

Page 33: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted ... · erence resolution was conducted on the MUC cor-pora, using a decision tree learner, by Soon et al. (2001). Signicant

IDPa

rtic

ipan

tTr

ain

Test

ME

NT

ION

DE

TE

CT

ION

CO

RE

FE

RE

NC

ER

ESO

LU

TIO

NO

ffici

alSy

ntax

Synt

axM

entio

nQ

lty.

MU

CB

CU

BE

DC

EA

F mC

EA

F eB

LA

NC

AG

AG

NB

GB

GM

RP

FR

PF 1

RP

F 2R

PF

RP

F 3R

PF

F1+

F2+

F3

3

a0.0

0fe

rnan

des

��

�62

.72

67.0

064

.79

43.6

349

.69

46.4

662

.70

72.1

967

.11

55.5

955

.59

55.5

952

.49

46.0

949

.08

63.9

871

.91

66.9

754

.22

a0.0

1bj

orke

lund

��

�56

.78

64.8

660

.55

43.9

052

.51

47.8

262

.89

75.3

268

.54

53.4

253

.42

53.4

248

.45

40.8

044

.30

66.4

574

.61

69.6

353

.55

a0.0

2ur

yupi

na�

��

56.4

754

.35

55.3

941

.33

41.6

641

.49

65.7

769

.23

67.4

650

.82

50.8

250

.82

42.4

342

.13

42.2

865

.58

70.5

667

.69

50.4

1a0

.03

stam

borg

��

�56

.10

63.2

859

.47

39.1

143

.49

41.1

861

.57

67.9

564

.61

50.1

650

.16

50.1

644

.86

40.3

642

.49

66.8

066

.94

66.8

749

.43

a0.0

4ch

en�

��

56.1

663

.95

59.8

038

.13

39.9

639

.02

60.5

962

.51

61.5

347

.49

47.4

947

.49

41.8

939

.84

40.8

466

.45

61.8

463

.69

47.1

3a0

.05

zhek

ova

��

�27

.54

80.3

441

.02

19.6

462

.13

29.8

541

.91

90.7

257

.33

42.7

442

.74

42.7

456

.79

24.8

134

.53

57.1

079

.19

60.6

540

.57

a0.0

6li

��

�18

.17

80.4

329

.65

10.7

755

.60

18.0

536

.17

93.3

452

.14

37.0

337

.03

37.0

355

.45

20.9

530

.41

52.9

173

.93

54.1

233

.53

a1.0

0fe

rnan

des

��

�65

.03

68.7

166

.82

46.3

851

.78

48.9

363

.53

72.3

767

.66

56.4

956

.49

56.4

952

.57

46.8

849

.56

64.8

472

.97

67.9

455

.38

a1.0

1bj

orke

lund

��

�58

.33

64.6

061

.30

45.1

452

.15

48.3

963

.73

74.4

568

.68

53.5

253

.52

53.5

247

.78

41.5

344

.44

66.8

173

.83

69.6

553

.84

a1.1

2ch

en�

��

56.4

163

.41

59.7

038

.22

39.5

738

.89

60.9

162

.06

61.4

847

.73

47.7

347

.73

41.8

040

.27

41.0

266

.70

61.8

663

.78

47.1

3a1

.13

zhek

ova

��

�28

.00

82.2

141

.78

15.4

745

.92

23.1

539

.22

84.8

653

.65

39.5

239

.52

39.5

255

.10

24.2

233

.65

54.1

361

.78

55.6

336

.82

a2.0

0bj

orke

lund

��

�61

.88

62.5

262

.20

46.1

147

.66

46.8

765

.83

69.7

467

.73

53.7

753

.77

53.7

745

.82

44.3

345

.06

67.6

970

.71

69.0

653

.22

a3.0

0bj

orke

lund

��

�67

.07

62.4

464

.67

51.5

749

.76

50.6

569

.53

69.8

869

.71

56.2

156

.21

56.2

146

.26

47.9

847

.11

71.0

972

.67

71.8

555

.82

a4.0

0fe

rnan

des

��

�65

.34

64.8

265

.08

45.1

847

.39

46.2

664

.56

69.4

466

.91

54.8

854

.88

54.8

849

.73

47.3

948

.53

64.2

870

.09

66.6

453

.90

a4.0

1bj

orke

lund

��

�57

.77

63.7

460

.61

44.7

851

.47

47.9

063

.75

74.2

768

.61

53.1

853

.18

53.1

847

.16

41.2

444

.00

66.9

473

.43

69.6

153

.50

a4.0

2st

ambo

rg�

��

57.4

364

.62

60.8

140

.22

44.1

742

.10

61.4

567

.24

64.2

249

.92

49.9

249

.92

44.6

040

.50

42.4

666

.79

66.0

866

.42

49.5

9a4

.03

chen

��

�57

.21

62.5

559

.76

38.6

639

.24

38.9

561

.52

61.7

761

.65

47.8

447

.84

47.8

441

.55

40.9

041

.22

66.7

861

.94

63.8

747

.27

a4.0

4zh

ekov

a�

��

27.4

875

.53

40.2

918

.75

56.4

728

.16

42.6

789

.25

57.7

442

.57

42.5

742

.57

55.5

325

.36

34.8

256

.61

76.3

559

.86

40.2

4a4

.05

li�

��

52.9

520

.71

29.7

820

.62

7.78

11.3

079

.37

41.2

154

.25

33.6

833

.68

33.6

821

.73

42.8

728

.84

54.0

451

.10

51.4

631

.46

a5.0

0fe

rnan

des

��

�65

.03

68.7

166

.82

46.3

851

.78

48.9

363

.53

72.3

767

.66

56.4

956

.49

56.4

952

.57

46.8

849

.56

64.8

472

.97

67.9

455

.38

a5.0

1bj

orke

lund

��

�58

.29

64.6

361

.30

45.1

452

.20

48.4

163

.71

74.5

068

.68

53.5

253

.52

53.5

247

.80

41.5

144

.44

66.8

173

.84

69.6

553

.84

a5.0

2st

ambo

rg�

��

57.6

864

.18

60.7

640

.53

43.9

842

.18

61.7

066

.75

64.1

349

.55

49.5

549

.55

44.0

140

.47

42.1

665

.23

64.8

965

.06

49.4

9a5

.03

chen

��

�56

.41

63.4

559

.72

38.2

239

.59

38.8

960

.90

62.0

761

.48

47.7

347

.73

47.7

341

.81

40.2

641

.02

66.7

061

.86

63.7

847

.13

a5.0

4zh

ekov

a�

��

28.0

682

.39

41.8

715

.56

46.1

823

.28

39.2

384

.95

53.6

739

.52

39.5

239

.52

55.1

024

.20

33.6

354

.15

61.9

555

.66

36.8

6a6

.00

bjor

kelu

nd�

�67

.04

62.4

764

.67

51.5

749

.80

50.6

769

.52

69.9

269

.72

56.2

156

.21

56.2

146

.27

47.9

547

.10

71.1

072

.70

71.8

655

.83

a7.0

0fe

rnan

des

��

�10

0.00

100.

0010

0.00

57.2

576

.48

65.4

860

.27

79.8

168

.68

62.5

662

.56

62.5

672

.61

46.0

056

.32

69.0

374

.87

71.4

963

.49

a7.0

1bj

orke

lund

��

�61

.85

100.

0076

.43

49.5

778

.62

60.8

155

.55

85.3

567

.29

59.5

059

.50

59.5

070

.28

37.9

949

.32

70.6

980

.85

74.6

159

.14

a7.0

2zh

ekov

a�

��

57.9

510

0.00

73.3

842

.48

80.3

655

.58

50.8

789

.69

64.9

255

.42

55.4

255

.42

71.9

634

.52

46.6

661

.36

82.0

066

.12

55.7

2a7

.03

stam

borg

��

�56

.13

100.

0071

.90

41.9

969

.78

52.4

350

.45

81.3

062

.26

54.0

054

.00

54.0

066

.16

34.5

245

.37

67.3

773

.46

69.8

753

.35

a7.0

4ch

en�

��

58.2

910

0.00

73.6

541

.72

63.2

350

.28

50.0

075

.25

60.0

853

.16

53.1

653

.16

64.6

036

.24

46.4

367

.15

66.6

566

.90

52.2

6a7

.05

li�

��

35.6

710

0.00

52.5

822

.43

64.6

233

.31

38.6

788

.07

53.7

442

.25

42.2

542

.25

60.9

524

.36

34.8

155

.64

68.5

257

.96

40.6

2a8

.00

fern

ande

s�

��

100.

0010

0.00

100.

0056

.89

76.2

765

.17

60.0

780

.02

68.6

262

.62

62.6

262

.62

72.2

445

.58

55.9

069

.35

75.5

171

.93

63.2

3a8

.01

bjor

kelu

nd�

��

61.0

510

0.00

75.8

149

.17

78.3

160

.41

55.5

185

.40

67.2

859

.41

59.4

159

.41

70.0

137

.71

49.0

270

.97

80.9

274

.84

58.9

0a8

.02

zhek

ova

��

�65

.68

100.

0079

.29

45.5

873

.27

56.2

052

.27

82.3

563

.95

55.1

155

.11

55.1

170

.17

37.5

448

.91

59.9

472

.07

63.2

856

.35

a8.0

3st

ambo

rg�

��

56.7

210

0.00

72.3

842

.88

70.4

253

.30

51.1

780

.83

62.6

754

.12

54.1

254

.12

66.2

134

.85

45.6

667

.10

72.3

269

.29

53.8

8a8

.04

chen

��

�58

.26

100.

0073

.63

41.8

163

.28

50.3

650

.10

75.1

960

.13

53.1

953

.19

53.1

964

.59

36.2

746

.46

67.1

966

.52

66.8

552

.32

a9.0

0bj

orke

lund

��

�68

.50

100.

0081

.30

55.2

178

.84

64.9

459

.85

83.7

569

.81

63.1

263

.12

63.1

272

.24

42.7

553

.71

73.3

580

.61

76.4

162

.82

Tabl

e22

:Per

form

ance

ofsy

stem

sin

the

prim

ary

and

supp

lem

enta

ryev

alua

tions

fort

hecl

osed

trac

kfo

rAra

bic.

33

Page 34: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted ... · erence resolution was conducted on the MUC cor-pora, using a decision tree learner, by Soon et al. (2001). Signicant

7.3 Arabic ClosedTable 22 shows the performance for the Arabic

language in greater detail.

7.3.1 Official SettingUnlike English and Chinese, none of the system

was particularly tuned for Arabic. This gives us anunique opportunity to test the performance variationof a mostly statistical, roughly language indepen-dent mechanism. Although, there could possibly bea significant bias that Arabic language brings to themix. The overall performance for Arabic seems tobe about ten points below both English and Chinese.On the mention detection front, most of the systemshave a balanced precision and recall, and the dropin performance seems quite steady. bjorkelund hasa slight edge on fernandes on the MUC, BCUBEDand BLANC metrics, but fernandes has a much largerlead on both the CEAF metrics, putting it on the topin the official score. We haven’t reported the de-velopment set numbers here, but another thing tonote especially for Arabic is that performance onArabic test set is significantly better than on the de-velopment set as pointed out by bjorkelund. Thisis probably because of the smaller size of the train-ing set and therefore a higher relative increment overtraining set. The size of the training set (which isroughly about a third of either Engish or Chinese)also could itself be a factor that explains the lowerperformance, and that Arabic performance mightgain from more data. chen did not use developmentdata for the final models. Using that could have in-creased their score.

7.3.2 Gold Mention BoundariesThe system performance given gold boundaries

followed more of the trend in English than Chinese.There was not much improvement over the primaryNB evaluation. Interestingly, chen uses gold bound-aries for Chinese so well, but does not get any per-formance improvement. This might indicate that thetechnique that helped that system in Chinese doesnot generalize well across languages.

7.3.3 Gold MentionsPerformance given gold mentions seems to be

about ten points higher than in the NB case.bjorkelund does well on BLANC metric than fernan-des even after getting a big hit in recall for mentiondetection. In absence of chang, it seems like fernan-des is the only one that explicitly adds a constraintfor the GM case and gets a perfect mention detec-tion score. All other systems loose significantly onrecall.

7.3.4 Gold Test ParsesFinally, providing gold parses during testing does

not have much of an impact on the scores.

7.4 All Languages OpenTables 24, 25 and 26, give the performance for

the systems that participated in the open track. Notmany systems participated in this track, so there isnot a lot to observe. One thing to note is that chenmodified precise constructs sieve to add named en-tity information in the open track sieve which gavethem a point improvement in performance. Withgold mentions and gold syntax during testing thechen system performance almost approaches an F-score of 80 (79.79)

7.5 Headword-based and Genre specific scoresSince last year’s task showed that there was only

some very local difference in ranking between sys-tems scored using the strict boundaries versus theones using headword based scoring, we did not com-pute the headword based evaluation.

Owing to space constraints, we cannot present adetailed analysis of the variation across genre. How-ever, since genre variation is important to note, wepresent the performance of the highest performingsystem across all the three languages and genres inTable 23. For each language there are three logicalperformance blocks: i) The official, predicted ver-sion, with no provided boundaries is the first block;ii) The supplementary version with gold mentionboundaries is the second block; and iii) The thirdblock shows the performance for the supplementaryversion given gold mentions.

Looking at the Engish performance on the official,closed track, there seems to be a cluster of genre –BC, BN, NW and WB – where the performance is veryclose to a score of 60. Whereas, genres TC, MZ andPT are increasingly better. Surprisingly, a simplisticlook at the individual metrics does indicate a similartrend, except for the CEAFe score for the TC and WBbeing somewhat reversed. It so happens that thesethe two genres — MZ and PT – are professional hu-man translations from a foreign language. As seenearlier, there is not a huge shift in performance whenthe systems are provided with gold mention bound-aries. However, when provided with gold mentionsthere is a big improvement in performance acrossthe board. Especially so with MZ genre for whichthe improvement is more than double (9.5 points)over the improvement in PT genre (3.5 points) withthe most notable improvement (of 5 points) in theCEAFe metric, which also is another indication thatthis metric does a good job of rewarding correctanaphoric mentions.

Looking at the Chinese performance, we see thatthe NW genre does particularly worse than all oth-ers on the official, closed track. The BC genre doessomewhat worse than WB, MZ, and TC all of whichseem to be around the same ballpark, with BN lead-ing the pack. Again, provided gold mention bound-

34

Page 35: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted ... · erence resolution was conducted on the MUC cor-pora, using a decision tree learner, by Soon et al. (2001). Signicant

GenreTrain Test MD COREFERENCE RESOLUTION OfficialSyntax Syntax Mention Qlty. MUC BCUBED CEAFm CEAFe BLANC

A G A G NB GB GM F F1 F2 F F3 FF1+F2+F3

3

ENGLISH

Pivot Text [PT] � � � 89.13 82.49 72.66 68.92 54.47 79.20 69.87Magazine [MZ] � � � 77.70 69.57 77.29 68.88 57.07 81.84 67.98Telephone Conversation [TC] � � � 79.95 76.75 72.31 62.06 43.22 79.24 64.09Weblogs and Newsgroups [WB] � � � 78.21 71.66 68.61 59.42 45.24 76.42 61.84Broadcast News [BN] � � � 74.60 65.15 70.60 60.90 49.52 74.45 61.76Brodcast Conversation [BC] � � � 75.67 67.54 69.14 57.70 44.99 76.74 60.56Newswire [NW] � � � 71.24 62.67 71.01 60.61 47.73 75.40 60.47Pivot Text [PT] � � � 89.50 82.74 72.65 68.98 54.28 79.42 69.89Magazine [MZ] � � � 77.27 68.68 76.53 67.51 55.63 79.72 66.95Telephone Conversation [TC] � � � 81.95 78.18 72.53 63.33 44.32 77.99 65.01Weblogs and Newsgroups [WB] � � � 79.08 72.62 68.94 60.09 45.74 76.46 62.43Broadcast News [BN] � � � 75.10 65.56 69.98 60.47 49.14 74.10 61.56Brodcast Conversation [BC] � � � 75.96 67.64 68.51 57.14 44.85 74.81 60.33Newswire [NW] � � � 70.44 61.63 70.04 59.44 46.57 73.51 59.41Magazine [MZ] � � � 100.00 82.87 83.10 78.02 66.93 87.00 77.63Pivot Text [PT] � � � 100.00 86.20 74.30 71.67 59.43 80.12 73.31Telephone Conversation [TC] � � � 100.00 84.74 75.18 66.29 49.68 77.37 69.87Weblogs and Newsgroups [WB] � � � 100.00 82.38 71.43 66.08 53.28 77.96 69.03Newswire [NW] � � � 100.00 74.00 74.41 67.39 53.28 81.03 67.23Broadcast News [BN] � � � 100.00 74.51 73.31 65.71 52.96 79.15 66.93Brodcast Conversation [BC] � � � 100.00 77.52 71.49 63.73 50.54 79.59 66.52

CHINESE

Broadcast News [BN] � � � 78.02 71.71 78.80 68.93 55.87 83.85 68.79Weblogs and Newsgroups [WB] � � � 79.29 71.30 71.05 60.68 46.81 80.94 63.05Magazine [MZ] � � � 75.34 70.26 72.32 62.63 46.42 81.34 63.00Telephone Conversation [TC] � � � 79.79 72.58 71.14 61.16 43.78 76.82 62.50Brodcast Conversation [BC] � � � 73.80 64.22 67.68 55.38 42.89 72.98 58.26Newswire [NW] � � � 52.38 49.74 67.97 54.82 43.79 75.63 53.83Broadcast News [BN] � � � 78.02 71.71 78.80 68.93 55.87 83.85 68.79Weblogs and Newsgroups [WB] � � � 79.29 71.30 71.05 60.68 46.81 80.94 63.05Magazine [MZ] � � � 75.34 70.26 72.32 62.63 46.42 81.34 63.00Telephone Conversation [TC] � � � 79.79 72.58 71.14 61.16 43.78 76.82 62.50Brodcast Conversation [BC] � � � 73.80 64.22 67.68 55.38 42.89 72.98 58.26Newswire [NW] � � � 52.38 49.74 67.97 54.82 43.79 75.63 53.83Broadcast News [BN] � � � 100.00 81.03 81.34 75.07 62.99 86.18 75.12Telephone Conversation [TC] � � � 100.00 87.31 77.80 71.01 59.44 78.78 74.85Weblogs and Newsgroups [WB] � � � 100.00 80.36 72.46 64.49 51.93 81.10 68.25Magazine [MZ] � � � 100.00 75.18 73.12 65.90 48.81 84.17 65.70Brodcast Conversation [BC] � � � 100.00 76.42 70.01 61.75 49.81 74.14 65.41Newswire [NW] � � � 100.00 51.42 67.83 55.29 43.81 76.53 54.35

ARABIC

Newswire [NW] � � � 64.79 46.46 67.11 55.59 49.08 66.97 54.22Newswire [NW] � � � 65.08 46.26 66.91 54.88 48.53 66.64 53.90Newswire [NW] � � � 100.00 65.48 68.68 62.56 56.32 71.49 63.49

Table 23: Per genre performance for fernandes on the closed, official and supplementary evaluations.

aries there is very little or no change in performance.And, when given the gold mentions the performanceagain shoots up by a significant margin. Here again,we see that the delta improvement in one particu-lar genre TC – is much higher (12 points) than inBN (6 points), and once again the most improvementamong all the metrics happens to be for the CEAFe.Extremely surprising is the fact that the NW genreshows the lowest improvement among all genre. Infact, the performance drops for the BCUBED met-ric. This might have something to do with the factthat Chinese NW genre gets the lowest ITA amongall other (see Table 1), but then the better scoring TCgenre which has the second lowest ITA does con-siderably better (leading by roughly 10 points in theofficial setting, and 20 points in the gold mentionssettings with respect to the TC genre). It could alsobe possible that this has something to do with thefact (and pointed out earlier when discussing chen’sresults) that there is some overlapping mentions thatwere mistakenly included in the release.

As for Arabic, since there was only one NW genre,there is nothing more to be analyzed. We plan toreport more detailed tables and analysis on the taskwebpage.

8 Comparison with CoNLL-2011Table 27 shows the performance of the systems

on CoNLL-2011 test subset which included only theEnglish portion of OntoNotes v4.0. For the Englishsubset, the size of training data in CoNLL-2011was roughly 76% of CoNLL-2012 training data (1Mvs 1.3M words respectively). Although the mod-els used to generate this table were trained on theCoNLL-2012 English data and therefore on about200K more words, it is still a small fraction of thetotal training data. In the past, coreference scoreshave shown to asymptote after a small fraction ofthe total training data. Therefore, the 5% absolutegap between the best performing systems of last yearcan be attributed to algorithmic improvement, andpossibly better rules. Given that a 200K data addi-tion to a 1M word corpus is unlikely to help iden-tify novel rules, and given that bjorkelund reportedadding (about 160K) development data (to the train-ing portion) to train the final model had very lit-tle improvement in performance over using just thetraining data by itself, the possibility that the gainis from algorithmic improvements seems even moreplausible.

It is interesting to note that although the winning

35

Page 36: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted ... · erence resolution was conducted on the MUC cor-pora, using a decision tree learner, by Soon et al. (2001). Signicant

Part

icip

ant

Trai

nTe

stM

EN

TIO

ND

ET

EC

TIO

NC

OR

EF

ER

EN

CE

RE

SOL

UT

ION

Offi

cial

Synt

axSy

ntax

Men

tion

Qlty

.M

UC

BC

UB

ED

CE

AF m

CE

AF e

BL

AN

C

AG

AG

NB

GB

GM

RP

FR

PF 1

RP

F 2R

PF

RP

F 3R

PF

F1+

F2+

F3

3

xion

g�

��

75.2

272

.23

73.6

964

.08

63.5

763

.82

66.4

770

.69

68.5

257

.23

57.2

357

.23

45.0

945

.64

45.3

671

.12

77.9

073

.94

59.2

3xi

ong

��

�77

.85

73.4

475

.58

67.0

365

.27

66.1

468

.03

71.1

469

.55

58.5

858

.58

58.5

845

.60

47.5

346

.54

72.0

378

.58

74.7

860

.74

Tabl

e24

:Per

form

ance

ofsy

stem

sin

the

prim

ary

and

supp

lem

enta

ryev

alua

tions

fort

heop

entr

ack

forE

nglis

h.

Part

icip

ant

Trai

nTe

stM

EN

TIO

ND

ET

EC

TIO

NC

OR

EF

ER

EN

CE

RE

SOL

UT

ION

Offi

cial

Synt

axSy

ntax

Men

tion

Qlty

.M

UC

BC

UB

ED

CE

AF m

CE

AF e

BL

AN

C

AG

AG

NB

GB

GM

RP

FR

PF 1

RP

F 2R

PF

RP

F 3R

PF

F1+

F2+

F3

3

chen

��

�71

.45

73.4

572

.44

62.4

867

.08

64.7

071

.21

78.3

574

.61

63.4

863

.48

63.4

853

.64

49.1

051

.27

75.1

584

.29

78.9

463

.53

yuan

��

�73

.71

63.9

768

.49

63.6

758

.48

60.9

674

.04

72.1

673

.09

60.0

560

.05

60.0

546

.75

51.5

249

.02

74.3

277

.99

76.0

261

.02

xion

g�

��

39.4

767

.55

49.8

230

.00

51.2

037

.83

49.3

777

.45

60.3

042

.71

42.7

142

.71

46.1

028

.12

34.9

357

.98

67.0

860

.59

44.3

5ch

en�

��

83.5

080

.44

81.9

574

.77

74.9

374

.85

77.1

480

.80

78.9

370

.13

70.1

370

.13

58.6

458

.46

58.5

578

.87

86.6

382

.24

70.7

8yu

an�

��

83.9

269

.85

76.2

474

.24

65.1

169

.37

78.6

173

.74

76.1

064

.84

64.8

464

.84

49.4

158

.70

53.6

677

.60

79.4

578

.49

66.3

8xi

ong

��

�42

.78

71.1

053

.42

32.5

753

.89

40.6

049

.50

77.3

860

.37

43.6

643

.66

43.6

647

.04

28.8

335

.75

58.2

467

.37

60.9

045

.57

chen

��

�82

.39

80.1

181

.24

73.5

074

.28

73.8

876

.30

80.4

978

.34

69.4

069

.40

69.4

058

.22

57.3

257

.77

78.4

486

.39

81.8

870

.00

chen

��

�83

.50

80.4

481

.95

74.7

774

.93

74.8

577

.14

80.8

078

.93

70.1

370

.13

70.1

358

.64

58.4

658

.55

78.8

786

.63

82.2

470

.78

chen

��

�84

.80

100.

0091

.77

78.1

293

.19

84.9

975

.04

91.5

982

.50

77.5

077

.50

77.5

084

.03

59.1

769

.44

81.4

690

.73

85.4

178

.98

chen

��

�85

.71

100.

0092

.31

79.0

793

.59

85.7

275

.83

91.9

483

.11

78.2

678

.26

78.2

684

.77

60.4

270

.55

81.7

191

.00

85.6

779

.79

Tabl

e25

:Per

form

ance

ofsy

stem

sin

the

prim

ary

and

supp

lem

enta

ryev

alua

tions

fort

heop

entr

ack

forC

hine

se.

Part

icip

ant

Trai

nTe

stM

EN

TIO

ND

ET

EC

TIO

NC

OR

EF

ER

EN

CE

RE

SOL

UT

ION

Offi

cial

Synt

axSy

ntax

Men

tion

Qlty

.M

UC

BC

UB

ED

CE

AF m

CE

AF e

BL

AN

C

AG

AG

NB

GB

GM

RP

FR

PF 1

RP

F 2R

PF

RP

F 3R

PF

F1+

F2+

F3

3

xion

g�

��

55.0

852

.75

53.8

928

.20

28.4

328

.31

60.8

962

.81

61.8

347

.31

47.3

147

.31

43.1

242

.82

42.9

757

.05

60.7

558

.46

44.3

7xi

ong

��

�57

.55

52.9

855

.17

30.9

930

.10

30.5

462

.16

62.5

562

.36

47.7

347

.73

47.7

342

.48

43.5

943

.03

57.7

861

.39

59.2

045

.31

Tabl

e26

:Per

form

ance

ofsy

stem

sin

the

prim

ary

and

supp

lem

enta

ryev

alua

tions

fort

heop

entr

ack

forA

rabi

c.

Part

icip

ant

Trai

nTe

stM

EN

TIO

ND

ET

EC

TIO

NC

OR

EF

ER

EN

CE

RE

SOL

UT

ION

Offi

cial

Synt

axSy

ntax

Men

tion

Qlty

.M

UC

BC

UB

ED

CE

AF m

CE

AF e

BL

AN

C

AG

AG

NB

GB

GM

RP

FR

PF 1

RP

F 2R

PF

RP

F 3R

PF

F1+

F2+

F3

3

2011

best

��

�75

.07

66.8

170

.70

61.7

657

.53

59.5

768

.40

68.2

368

.31

56.3

756

.37

56.3

743

.41

47.7

545

.48

70.6

376

.21

73.0

257

.79

fern

ande

s�

��

70.1

281

.27

75.2

862

.74

73.4

667

.68

65.0

378

.05

70.9

560

.61

60.6

160

.61

54.1

842

.50

47.6

375

.31

78.5

376

.80

62.0

9m

arts

chat

��

�71

.96

73.5

772

.76

62.8

066

.23

64.4

767

.09

74.5

070

.60

58.8

458

.84

58.8

448

.05

44.5

546

.23

73.1

778

.04

75.3

260

.43

bjor

kelu

nd�

��

70.9

974

.91

72.9

062

.25

67.7

264

.87

65.6

675

.32

70.1

657

.99

57.9

957

.99

48.1

242

.67

45.2

372

.81

77.6

474

.94

60.0

9ch

ang

��

�69

.94

73.3

671

.61

62.5

165

.54

63.9

967

.76

71.9

769

.80

57.4

957

.49

57.4

945

.86

42.8

244

.29

75.3

574

.13

74.7

259

.36

chen

��

�73

.13

69.5

671

.30

60.8

460

.70

60.7

767

.34

71.3

769

.30

57.4

757

.47

57.4

745

.98

46.1

346

.05

70.8

978

.69

74.0

658

.71

stam

borg

��

�72

.83

69.7

871

.27

63.4

761

.16

62.2

969

.10

69.1

969

.14

55.3

755

.37

55.3

741

.89

44.1

342

.98

72.8

675

.67

74.1

658

.14

chun

yang

��

�73

.14

69.3

171

.17

61.7

860

.57

61.1

767

.43

70.3

168

.84

56.6

256

.62

56.6

244

.48

45.7

045

.08

71.3

176

.91

73.7

258

.36

yuan

��

�70

.44

68.8

769

.64

58.9

960

.09

59.5

366

.66

71.3

868

.94

56.9

556

.95

56.9

545

.48

44.3

844

.92

72.2

078

.69

74.9

557

.80

shou

��

�73

.26

69.1

671

.15

61.0

259

.24

60.1

266

.14

68.3

467

.22

54.8

854

.88

54.8

843

.52

45.3

544

.42

69.0

073

.39

70.9

257

.25

xu�

��

58.9

082

.61

68.7

755

.14

73.2

862

.93

60.5

075

.93

67.3

552

.81

52.8

152

.81

48.7

132

.17

38.7

572

.80

70.0

571

.30

56.3

4ur

yupi

na�

��

69.6

466

.80

68.1

958

.92

57.7

258

.31

65.0

568

.26

66.6

252

.25

52.2

552

.25

40.7

141

.83

41.2

668

.07

72.2

569

.89

55.4

0ya

ng�

��

61.5

170

.97

65.9

052

.15

61.8

856

.60

60.3

773

.02

66.0

951

.07

51.0

751

.07

44.6

935

.98

39.8

766

.20

71.2

768

.29

54.1

9xi

nxin

��

�71

.16

50.9

859

.40

52.8

839

.37

45.1

368

.73

56.1

661

.81

44.6

844

.68

44.6

831

.27

43.5

036

.38

65.7

164

.79

65.2

347

.77

zhek

ova

��

�63

.16

65.7

364

.42

50.6

449

.51

50.0

761

.71

56.5

359

.01

43.4

043

.40

43.4

034

.01

35.0

934

.54

65.4

958

.37

60.2

147

.87

li�

��

43.3

984

.81

57.4

036

.45

70.5

348

.06

43.1

181

.37

56.3

641

.88

41.8

841

.88

49.5

222

.69

31.1

262

.31

67.0

264

.12

45.1

8

Tabl

e27

:Per

form

ance

ofal

lthe

syst

ems

onth

eC

oNL

L-2

011

port

ion

(Eng

lish)

ofth

eC

oNL

L-2

012

test

set.

36

Page 37: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted ... · erence resolution was conducted on the MUC cor-pora, using a decision tree learner, by Soon et al. (2001). Signicant

system in the CoNLL-2011 task was a completelyrule-based one, modified version of the same systemused by shou and xiong ranked close to 10. Thisdoes indicate that a hybrid approach has some ad-vantage over a purely rule-based system. Improve-ment seems to be mostly owing to higher precisionin mention detection, MUC, BCUBED, and higher re-call in CEAFe

9 ConclusionsIn this paper we described the anaphoric coref-

erence information and other layers of annotationin the OntoNotes corpus, over three languages —English, Chinese and Arabic — and presented theresults from an evaluation on learning such unre-stricted entities and events in text. The followingrepresents our conclusions on reviewing the results:

• Most top performing systems used a hybrid-approach combining rule-based strategies withmachine learning. Rule-based approach doesseem to bring a system to a close-to-best per-formance region. The most significant advan-tage of the rule-based approach seems to bethat it captures most confident links before con-sidering less confident ones. Discourse infor-mation when present is quite helpful to disam-biguate pronominal mentions. Using informa-tion from appositives and copular constructionsseems beneficial to bridge across various lex-icalized mentions. It is not clear how muchmore can be gained using further strategies.The features for coreference prediction are cer-tainly more complex than for many other lan-guage processing tasks, which makes it morechallenging to generate effective feature com-binations.• Most top performing systems did significant

feature engineering – expecially a heavy use oflexicalized features, which was possible giventhe size of the corpus, and performed featureselection.• It might be possible that the Chinese accuracy

with gold boundaries and mentions is better be-cause the distribution of mentions across thevarious genres is different, and if there are morementions in better scoring genres, then the per-formance would improve overall.• Gold parse during testing does seem to help

quite a bit. Gold boundaries are not of muchsignificance for English and Arabic, but seemto be very useful for Chinese. The reason prob-ably has some roots in the parser performancegap for Chinese.• It does seem that collecting information about

an entity by merging information across thevarious attributes of the mentions that comprise

it can be useful, though not all systems that at-tempted this achieved a benefit, and has to bedone carefully.

• It is noteworthy that systems did not seem toattempt the kind of joint inference that couldmake use of the full potential of various lay-ers available in OntoNotes, but this could wellhave been owing to the limited time availablefor the shared task.

• We had expected to see more attention paid toevent coreference, which is a novel feature inthis data, but again, given the time constraintsand given that events represent only a smallportion of the total, it is not surprising that mostsystems chose not to focus on it.

• Scoring coreference seems to remain a signif-icant challenge. There does not seem to be anobjective way to establish one metric in pref-erence to another in the absence of a specificapplication. On the other hand, the systemrankings do not seem terribly sensitive to theparticular metric chosen. It is interesting thatthe CEAFe metric — which tries to capture thegoodness of the entities in the output — seemmuch lower than the other metric, though it isnot clear whether that means that our systemsare doing a poor job of creating coherent en-tities or whether that metric is just especiallyharsh.

AcknowledgmentsWe gratefully acknowledge the support of the

Defense Advanced Research Projects Agency(DARPA/IPTO) under the GALE program,DARPA/CMO Contract No. HR0011-06-C-0022.We would like to thank all the participants. Withouttheir hard work, patience and perseverance this eval-uation would not have been a success. We wouldalso like to thank the Linguistic Data Consortiumfor making the OntoNotes v5.0 corpus freely andtimely available in training/development/test setsto the participants. Emili Sapena, who graciouslyallowed the use of his scorer implementation. HweeTou Ng and his student Zhi Zhong for trainingthe word sense models and providing outputsfor the training/development and test sets. SlavPetrov and Dan Klein for letting us use their parser.Additionally, we are indebted to Slav for his helpin retraining the parser for Arabic. AlessandroMoschitti and Olga Uryupina have been partiallyfunded by the European Community’s SeventhFramework Programme (FP7/2007-2013) under thegrant number 288024 (LIMOSINE).

37

Page 38: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted ... · erence resolution was conducted on the MUC cor-pora, using a decision tree learner, by Soon et al. (2001). Signicant

ReferencesOlga Babko-Malaya, Ann Bies, Ann Taylor, Szuting Yi,

Martha Palmer, Mitch Marcus, Seth Kulick, and Li-bin Shen. 2006. Issues in synchronizing the Englishtreebank and propbank. In Workshop on Frontiers inLinguistically Annotated Corpora 2006, July.

Amit Bagga and Breck Baldwin. 1998. Algorithms forScoring Coreference Chains. In The First Interna-tional Conference on Language Resources and Eval-uation Workshop on Linguistics Coreference, pages563–566.

Elizabeth Baran and Nianwen Xue. 2011. Singular orPlural? Exploiting Parallel Corpora for Chinese Num-ber Prediction. In Proceedings of Machine TranslationSummit XIII.

Shane Bergsma and Dekang Lin. 2006. Bootstrappingpath-based pronoun resolution. In Proceedings of the21st International Conference on Computational Lin-guistics and 44th Annual Meeting of the Associationfor Computational Linguistics, pages 33–40, Sydney,Australia, July.

Jie Cai and Michael Strube. 2010. Evaluation metricsfor end-to-end coreference resolution systems. In Pro-ceedings of the 11th Annual Meeting of the Special In-terest Group on Discourse and Dialogue, SIGDIAL’10, pages 28–36.

Jie Cai, Eva Mujdricza-Maydt, and Michael Strube.2011a. Unrestricted coreference resolution via globalhypergraph partitioning. In Proceedings of the Fif-teenth Conference on Computational Natural Lan-guage Learning: Shared Task, pages 56–60, Portland,Oregon, USA, June. Association for ComputationalLinguistics.

Shu Cai, David Chiang, and Yoav Goldberg. 2011b.Language-independent parsing with empty elements.In Proceedings of the 49th Annual Meeting of the As-sociation for Computational Linguistics: Human Lan-guage Technologies, pages 212–216, Portland, Ore-gon, USA, June. Association for Computational Lin-guistics.

Wendy W Chapman, Prakash M Nadkarni, LynetteHirschman, Leonard W D’Avolio, Guergana KSavova, and Ozlem Uzuner. 2011. Overcoming bar-riers to NLP for clinical text: the role of shared tasksand the need for additional creative solutions. Journalof American Medical Informatics Association, 18(5),September.

Eugene Charniak and Mark Johnson. 2001. Edit detec-tion and parsing for transcribed speech. In Proceed-ings of the Second Meeting of North American Chapterof the Association of Computational Linguistics, June.

Eugene Charniak and Mark Johnson. 2005. Coarse-to-fine n-best parsing and maxent discriminative rerank-ing. In Proceedings of the 43rd Annual Meeting of the

Association for Computational Linguistics (ACL), AnnArbor, MI, June.

Nancy Chinchor and Beth Sundheim. 2003. Messageunderstanding conference (MUC) 6. In LDC2003T13.

Nancy Chinchor. 2001. Message understanding confer-ence (MUC) 7. In LDC2001T02.

Aron Culotta, Michael Wick, Robert Hall, and AndrewMcCallum. 2007. First-order probabilistic models forcoreference resolution. In HLT/NAACL, pages 81–88.

Pascal Denis and Jason Baldridge. 2007. Joint de-termination of anaphoricity and coreference resolu-tion using integer programming. In Proceedings ofHLT/NAACL.

Pascal Denis and Jason Baldridge. 2009. Global jointmodels for coreference resolution and named entityclassification. Procesamiento del Lenguaje Natural,(42):87–96.

George Doddington, Alexis Mitchell, Mark Przybocki,Lance Ramshaw, Stephanie Strassel, and RalphWeischedel. 2004. The automatic content extraction(ACE) program-tasks, data, and evaluation. In Pro-ceedings of LREC.

Charles Fillmore, Christopher Johnson, and Miriam R. L.Petruck. 2003. Background to FrameNet. Interna-tional Journal of Lexicography, 16(3).

Ryan Gabbard. 2010. Null Element Restoration. Ph.D.thesis, University of Pennsylvania.

Aria Haghighi and Dan Klein. 2010. Coreference reso-lution in a modular, entity-centered model. In HumanLanguage Technologies: The 2010 Annual Conferenceof the North American Chapter of the Association forComputational Linguistics, pages 385–393, Los An-geles, California, June.

Jan Hajic, Massimiliano Ciaramita, Richard Johans-son, Daisuke Kawahara, Maria Antonia Martı, LluısMarquez, Adam Meyers, Joakim Nivre, SebastianPado, Jan Stepanek, Pavel Stranak, Mihai Surdeanu,Nianwen Xue, and Yi Zhang. 2009. The CoNLL-2009 shared task: Syntactic and semantic dependen-cies in multiple languages. In Proceedings of the Thir-teenth Conference on Computational Natural Lan-guage Learning (CoNLL 2009): Shared Task, pages1–18, Boulder, Colorado, June.

Sanda M. Harabagiu, Razvan C. Bunescu, and Steven J.Maiorano. 2001. Text and knowledge mining forcoreference resolution. In NAACL.

Lynette Hirschman and Nancy Chinchor. 1997. Corefer-ence task definition (v3.0, 13 jul 97). In Proceedingsof the Seventh Message Understanding Conference.

Lynette Hirschman, Patricia Robinson, John Burger, andMarc Vilain. 1998. Automating coreference: The roleof annotated training data. In Proceedings of AAAISpring Symposium on Applying Machine Learning toDiscourse Processing.

38

Page 39: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted ... · erence resolution was conducted on the MUC cor-pora, using a decision tree learner, by Soon et al. (2001). Signicant

Eduard Hovy, Mitchell Marcus, Martha Palmer, LanceRamshaw, and Ralph Weischedel. 2006. OntoNotes:The 90% solution. In Proceedings of HLT/NAACL,pages 57–60, New York City, USA, June. Associationfor Computational Linguistics.

Karin Kipper, Anna Korhonen, Neville Ryant, andMartha Palmer. 2000. A large-scale classification ofenglish verbs. Language Resources and Evaluation,42(1):21 – 40.

Heeyoung Lee, Yves Peirsman, Angel Chang, NathanaelChambers, Mihai Surdeanu, and Dan Jurafsky. 2011.Stanford’s multi-pass sieve coreference resolution sys-tem at the conll-2011 shared task. In Proceedingsof the Fifteenth Conference on Computational Natu-ral Language Learning: Shared Task, pages 28–34,Portland, Oregon, USA, June. Association for Com-putational Linguistics.

Xiaoqiang Luo. 2005. On coreference resolution perfor-mance metrics. In Proceedings of Human LanguageTechnology Conference and Conference on EmpiricalMethods in Natural Language Processing, pages 25–32, Vancouver, British Columbia, Canada, October.

Mohamed Maamouri and Ann Bies. 2004. Developingan arabic treebank: Methods, guidelines, procedures,and tools. In Ali Farghaly and Karine Megerdoomian,editors, COLING 2004 Computational Approaches toArabic Script-based Languages, pages 2–9, Geneva,Switzerland, August 28th. COLING.

Mitchell P. Marcus, Beatrice Santorini, and Mary AnnMarcinkiewicz. 1993. Building a large annotated cor-pus of English: The Penn treebank. ComputationalLinguistics, 19(2):313–330, June.

Andrew McCallum and Ben Wellner. 2004. Conditionalmodels of identity uncertainty with application to nouncoreference. In Advances in Neural Information Pro-cessing Systems (NIPS).

Joseph McCarthy and Wendy Lehnert. 1995. Using de-cision trees for coreference resolution. In Proceedingsof the Fourteenth International Conference on Artifi-cial Intelligence, pages 1050–1055.

Thomas S. Morton. 2000. Coreference for nlp applica-tions. In Proceedings of the 38th Annual Meeting ofthe Association for Computational Linguistics, Octo-ber.

Eugene W. Myers. 1986. An O(ND) difference algo-rithm and its variations. Algorithmica, 1(2):251—-266.

Vincent Ng. 2007. Shallow semantics for coreferenceresolution. In Proceedings of the IJCAI.

Vincent Ng. 2010. Supervised noun phrase coreferenceresearch: The first fifteen years. In Proceedings of the48th Annual Meeting of the Association for Compu-tational Linguistics, pages 1396–1411, Uppsala, Swe-den, July.

David S. Pallett. 2002. The role of the National Insti-tute of Standards and Technology in DARPA’s Broad-cast News continuous speech recognition research pro-gram. Speech Communication, 37(1-2), May.

Martha Palmer, Daniel Gildea, and Paul Kingsbury.2005. The Proposition Bank: An annotated corpus ofsemantic roles. Computational Linguistics, 31(1):71–106.

Martha Palmer, Hoa Trang Dang, and Christiane Fell-baum. 2007. Making fine-grained and coarse-grainedsense distinctions, both manually and automatically.

Martha Palmer, Olga Babko-Malaya, Ann Bies, MonaDiab, Mohammed Maamouri, Aous Mansouri, andWajdi Zaghouani. 2008. A pilot arabic propbank. InProceedings of the International Conference on Lan-guage Resources and Evaluation (LREC), Marrakech,Morocco, May 28-30.

Rebecca Passonneau. 2004. Computing reliability forcoreference annotation. In Proceedings of LREC.

Slav Petrov and Dan Klein. 2007. Improved Inferencingfor Unlexicalized Parsing. In Proc of HLT-NAACL.

Massimo Poesio and Ron Artstein. 2005. The reliabilityof anaphoric annotation, reconsidered: Taking ambi-guity into account. In Proceedings of the Workshop onFrontiers in Corpus Annotations II: Pie in the Sky.

Massimo Poesio. 2004. The mate/gnome scheme foranaphoric annotation, revisited. In Proceedings ofSIGDIAL.

Simone Paolo Ponzetto and Massimo Poesio. 2009.State-of-the-art nlp approaches to coreference resolu-tion: Theory and practical recipes. In Tutorial Ab-stracts of ACL-IJCNLP 2009, page 6, Suntec, Singa-pore, August.

Simone Paolo Ponzetto and Michael Strube. 2005. Se-mantic role labeling for coreference resolution. InCompanion Volume of the Proceedings of the 11thMeeting of the European Chapter of the Associa-tion for Computational Linguistics, pages 143–146,Trento, Italy, April.

Simone Paolo Ponzetto and Michael Strube. 2006.Exploiting semantic role labeling, WordNet andWikipedia for coreference resolution. In Proceedingsof the HLT/NAACL, pages 192–199, New York City,N.Y., June.

Sameer Pradhan, Kadri Hacioglu, Valerie Krugler,Wayne Ward, James Martin, and Dan Jurafsky. 2005.Support vector learning for semantic argument classi-fication. Machine Learning Journal, 60(1):11–39.

Sameer Pradhan, Eduard Hovy, Mitchell Marcus, MarthaPalmer, Lance Ramshaw, and Ralph Weischedel.2007a. OntoNotes: A Unified Relational SemanticRepresentation. International Journal of SemanticComputing, 1(4):405–419.

39

Page 40: CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted ... · erence resolution was conducted on the MUC cor-pora, using a decision tree learner, by Soon et al. (2001). Signicant

Sameer Pradhan, Lance Ramshaw, Ralph Weischedel,Jessica MacBride, and Linnea Micciulla. 2007b.Unrestricted Coreference: Indentifying Entities andEvents in OntoNotes. In in Proceedings of theIEEE International Conference on Semantic Comput-ing (ICSC), September 17-19.

Karthik Raghunathan, Heeyoung Lee, Sudarshan Ran-garajan, Nate Chambers, Mihai Surdeanu, Dan Juraf-sky, and Christopher Manning. 2010. A multi-passsieve for coreference resolution. In Proceedings ofthe 2010 Conference on Empirical Methods in Natu-ral Language Processing, pages 492–501, Cambridge,MA, October. Association for Computational Linguis-tics.

Altaf Rahman and Vincent Ng. 2009. Supervised mod-els for coreference resolution. In Proceedings of the2009 Conference on Empirical Methods in NaturalLanguage Processing, pages 968–977, Singapore, Au-gust. Association for Computational Linguistics.

William M. Rand. 1971. Objective criteria for the evalu-ation of clustering methods. Journal of the AmericanStatistical Association, 66(336).

Marta Recasens and Eduard Hovy. 2011. Blanc: Im-plementing the rand index for coreference evaluation.Natural Language Engineering.

Marta Recasens, Lluıs Marquez, Emili Sapena,M. Antonia Martı, Mariona Taule, VeroniqueHoste, Massimo Poesio, and Yannick Versley. 2010.Semeval-2010 task 1: Coreference resolution inmultiple languages. In Proceedings of the 5th Interna-tional Workshop on Semantic Evaluation, pages 1–8,Uppsala, Sweden, July.

Wee Meng Soon, Hwee Tou Ng, and Daniel Chung YongLim. 2001. A machine learning approach to corefer-ence resolution of noun phrase. Computational Lin-guistics, 27(4):521–544.

Veselin Stoyanov, Nathan Gilbert, Claire Cardie, andEllen Riloff. 2009. Conundrums in noun phrase coref-erence resolution: Making sense of the state-of-the-art. In Proceedings of the Joint Conference of the 47thAnnual Meeting of the ACL and the 4th InternationalJoint Conference on Natural Language Processing ofthe AFNLP, pages 656–664, Suntec, Singapore, Au-gust. Association for Computational Linguistics.

Mihai Surdeanu, Richard Johansson, Adam Meyers,Lluıs Marquez, and Joakim Nivre. 2008. The CoNLL2008 shared task on joint parsing of syntactic and se-mantic dependencies. In CoNLL 2008: Proceedingsof the Twelfth Conference on Computational Natu-ral Language Learning, pages 159–177, Manchester,England, August.

Ozlem Uzuner, Andreea Bodnari, Shuying Shen, TylerForbush, John Pestian, and Brett R South. 2012. Eval-uating the state of the art in coreference resolution for

electronic medical records. Journal of American Med-ical Informatics Association, 19(5), September.

Yannick Versley, Simone Paolo Ponzetto, Massimo Poe-sio, Vladimir Eidelman, Alan Jern, Jason Smith,Xiaofeng Yang, and Alessandro Moschitti. 2008.BART: A modular toolkit for coreference resolution.In Proceedings of the 6th International Conference onLanguage Resources and Evaluation, Marrakech, Mo-rocco, May.

Yannick Versley. 2007. Antecedent selection techniquesfor high-recall coreference resolution. In Proceedingsof the 2007 Joint Conference on Empirical Methodsin Natural Language Processing and ComputationalNatural Language Learning (EMNLP-CoNLL).

Marc Vilain, John Burger, John Aberdeen, Dennis Con-nolly, and Lynette Hirschman. 1995. A model theo-retic coreference scoring scheme. In Proceedings ofthe Sixth Message Undersatnding Conference (MUC-6), pages 45–52.

Ralph Weischedel and Ada Brunstein. 2005. BBN pro-noun coreference and entity type corpus LDC catalogno.: LDC2005T33. BBN Technologies.

Ralph Weischedel, Eduard Hovy, Mitchell Marcus,Martha Palmer, Robert Belvin, Sameer Pradhan,Lance Ramshaw, and Nianwen Xue. 2011.OntoNotes: A large training corpus for enhanced pro-cessing. In Joseph Olive, Caitlin Christianson, andJohn McCary, editors, Handbook of Natural LanguageProcessing and Machine Translation: DARPA GlobalAutonomous Language Exploitation. Springer.

Nianwen Xue and Martha Palmer. 2009. Adding seman-tic roles to the Chinese Treebank. Natural LanguageEngineering, 15(1):143–172.

Nianwen Xue, Fei Xia, Fu dong Chiou, and MarthaPalmer. 2005. The Penn Chinese TreeBank: PhraseStructure Annotation of a Large Corpus. Natural Lan-guage Engineering, 11(2):207–238.

Nianwen Xue. 2008. Labeling Chinese Predicateswith Semantic Roles. Computational Linguistics,34(2):225–255.

Yaqin Yang and Nianwen Xue. 2010. Chasing theghost: recovering empty categories in the chinese tree-bank. In Proceedings of Proceedings of the 23rd In-ternational Conference on Computational Linguistics(COLING), Beijing, China.

Wajdi Zaghouani, Mona Diab, Aous Mansouri, SameerPradhan, and Martha Palmer. 2010. The revised ara-bic propbank. In Proceedings of the Fourth LinguisticAnnotation Workshop, pages 222–226, Uppsala, Swe-den, July.

Zhi Zhong and Hwee Tou Ng. 2010. It makes sense:A wide-coverage word sense disambiguation systemfor free text. In Proceedings of the ACL 2010 SystemDemonstrations, pages 78–83, Uppsala, Sweden.

40