1
http://lcl.uniroma1.it/eurosense EuroSense: Automatic Harvesting of Multilingual Sense Annotations from Parallel Text Alessandro Raganato Claudio Delli Bovi Roberto Navigli What is it? EuroSense is a multilingual sense-annotated resource, automatically built via the joint disambiguation of the Europarl parallel corpus [2] in 21 languages, with almost 123 million sense annotations for over 155 thousand distinct concepts and entities, drawn from the multilingual sense inventory of BabelNet [4]. EuroSense‘s disambiguation pipeline is designed to exploit at best the cross-language complementarities of the parallel corpus, without relying on word alignments against a pivot language. The tools [email protected] [email protected] [email protected] bn:17381127n bn:17381128n bn:09353187n Jose Camacho Collados [email protected] bn:17381131n Babelfy [3] is a state-of-the-art graph-based multilingual disambiguation and entity linking system powered by BabelNet http://babelfy.org Nasari [1] is a language-independent vector representation of concepts and entities from BabelNet and WIkipedia, http://lcl.uniroma1.it/nasari This debate on the state of the Union is taking place in a climate of uncertainty . Ce débat sur l’État de l’Union intervient dans un climat marqué par l’ incertitude . La discussione sullo stato dell’Unione si svolge in un clima segnato dall’ incertezza. Este debate sobre el Estado de la Unión tiene lugar en un clima de inseguridad. Diese Aussprache über die Lage der Union erfolgt in eine m Klima der Unsicherheit. Stage 1: Multilingual Disambiguation Stage 2: Similarity-based Refinement 10M 20M 30M 50k 100k 150k 26M 15M 22M 12M 22M 12M 21M 12M 19M 11M 16M 9M 16M 9M 15M 9M 10M 6M 5M 9M 8M 5M 4M 2M 3M 2M 3M 2M 3M 1M 2M 1M 2M 1M 2M 1M 2M 1M 2M 1M 2M 0.5M 138k 87K 63k 46k 65k 49k 74k 53k 63k 45k 75k 52k 61k 43k 58k 44k 51k 37k 30k 23k 42k 30k 22k 17k 22k 16k 23k 17k 20k 14k 20k 14k 21k 15k 25k 18k 19k 13k 20k 15k 5k 4k EuroSense: Statistics by Language Total Number of Distinct Concepts or Entities: Total Number of Sense Annotations: 215M 123M 248k 156k Stage 1 Stage 2 Stage 2 Stage 1 Multilingual preprocessing (tokenipart-of-speech tagging, lemmatization) with TreeTagger + Babelfy’s preprocessing pipeline; For each sentence, gather all its available translations together in a multilingual text; Multilingual disambiguation using Babelfy’s densest subgraph algorithm in such a way that it favors sense assignments that are consistent across languages. Coherence Score # connections d # connections i i D coherence(d) = bn:00064069n uncertainty, precariousness bn:14473459n State of the Union, State of the Union address bn:00058032n Union, North bn:00019781n climate, mood bn:00019780n climate, clime bn:00062699n place, spot For each sentence, identify a subset D of high-confidence disambiguations (using the coherence score) from stage 1; Take the Nasari vectors associated with the disambiguations in D and compute the centroid of D; Re-disambiguate the mentions associated with the remaining disambiguations with the sense having the closest Nasari vector to the centroid: Similarity Score Experimental Evaluation Intrinsic Evaluation: Annotation Quality 4 languages, 2 human judges per language, 50 random sentences for each configuration (baseline, stage 1, stage 2) EN FR DE ES 76.1 80.3 81.5 59.1 67.9 71.8 80.4 84.6 89.3 67.5 76.7 82.5 Stage 1 Stage 2 Babelfy (baseline) 90 80 70 60 50 Extrinsic Evaluation: Word Sense Disambiguation EuroSense’s English sense annotations as training set for a supervised WSD system: It Makes Sense (IMS) [5] SemEval-2013 SemEval-2015 50 55 60 65 70 63 65.3 65 66.4 67.8 69.3 69.1 69.5 IMS + OMSTI [6] IMS + EuroSense IMS + SemCor MCS References [1] J. Camacho-Collados, M. T. Pilehvar, and R. Navigli. 2016. Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation of concepts and entities. AIJ, 240:36–64. [2] P. Koehn. 2005. Europarl: A Parallel Corpus for Statistical Machine Translation. MT summit. vol. 5, pp. 79–86. [3] A. Moro, A. Raganato, and R. Navigli. 2014. Entity Linking meets Word Sense Disambiguation: a Unified Approach. TACL, 2:231–244. [4] R. Navigli and S. P. Ponzetto. 2012. BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. AIJ, 193:217–250. [5] Z. Zhong and H. T. Ng. 2010. It Makes Sense: A Wide-Coverage Word Sense Disambiguation System for Free Text. ACL: System Demonstrations, pp. 78–83. [6] K. Taghipour and H. T. Ng. 2015. One Million Sense-tagged instances for word sense disambiguation and induction. CoNLL, pp. 338–344. Precision (%) F-Score (%) The authors gratefully acknowledge the support of the ERC Consolidator Grant MOUSSE Contract no. 726487 Claudio Delli Bovi is supported by a Sapienza Research Grant ‘Avvio alla Ricerca 2016’. Jose Camacho-Collados is supported by a Google PhD Fellowship in NLP

EuroSense: Automatic Harvesting of Multilingual Sense Annotations … › ... › PosterACL2017_EuroSense.pdf · 2020-06-04 · Babelfy (baseline) 90 80 70 60 50 Extrinsic Evaluation:

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: EuroSense: Automatic Harvesting of Multilingual Sense Annotations … › ... › PosterACL2017_EuroSense.pdf · 2020-06-04 · Babelfy (baseline) 90 80 70 60 50 Extrinsic Evaluation:

http://lcl.uniroma1.it/eurosense

EuroSense: Automatic Harvesting of Multilingual Sense Annotations from Parallel Text

Alessandro Raganato

Claudio Delli Bovi

Roberto Navigli

What is it?● EuroSense is a multilingual sense-annotated resource,

automatically built via the joint disambiguation of the Europarl parallel corpus [2] in 21 languages, with almost 123 million sense annotations for over 155 thousand distinct concepts and entities, drawn from the multilingual sense inventory of BabelNet [4].

● EuroSense‘s disambiguation pipeline is designed to exploit at best the cross-language complementarities of the parallel corpus, without relying on word alignments against a pivot language.

The tools

[email protected]

[email protected]

[email protected]

bn:17381127n

bn:17381128n

bn:09353187n

Jose Camacho [email protected] bn:17381131n

● Babelfy [3] is a state-of-the-art graph-based multilingual disambiguation and entity linking system powered by BabelNet http://babelfy.org

● Nasari [1] is a language-independent vector representation of concepts and entities from BabelNet and WIkipedia, http://lcl.uniroma1.it/nasari

This debate on the state of the Union is taking place in a climate of uncertainty.

Ce débat sur l’État de l’Union intervient dans un climat marqué par l’incertitude.

La discussione sullo stato dell’Unione si svolge in un clima segnato dall’incertezza.

Este debate sobre el Estado de la Unión tiene lugar en un clima de inseguridad.

Diese Aussprache über die Lage der Union erfolgt in einem Klima der Unsicherheit.

Stage 1: Multilingual

Disambiguation

Stage 2: Similarity-based

Refinement

10M

20M

30M

50k

100k

150k

26M

15M

22M

12M

22M

12M

21M

12M

19M

11M

16M

9M

16M

9M

15M

9M10M

6M 5M

9M 8M

5M 4M2M 3M

2M3M

2M 3M1M

2M1M

2M1M

2M 1M 2M 1M 2M 1M 2M0.5M

138k

87K

63k

46k

65k

49k

74k

53k63k

45k

75k

52k61k

43k

58k

44k51k

37k30k

23k

42k30k

22k17k

22k16k

23k17k 20k

14k20k

14k21k

15k25k

18k 19k13k

20k15k

5k 4k

EuroSense: Statistics by Language

Total Number of Distinct Concepts or Entities:

Total Number of Sense Annotations:

215M 123M

248k 156k

Stage 1

Stage 2

Stage 2

Stage 1

● Multilingual preprocessing (tokenipart-of-speech tagging, lemmatization) with TreeTagger + Babelfy’s preprocessing pipeline;

● For each sentence, gather all its available translations together in a multilingual text;

● Multilingual disambiguation using Babelfy’s densest subgraph algorithm in such a way that it favors sense assignments that are consistent across languages.

Coherence Score

# connectionsd

# connectionsii ∈ D

coherence(d) =

bn:00064069nuncertainty, precariousness

bn:14473459nState of the Union, State of the Union address

bn:00058032nUnion, North

bn:00019781nclimate, mood

bn:00019780nclimate, clime

bn:00062699nplace, spot

● For each sentence, identify a subset D of high-confidence disambiguations (using the coherence score) from stage 1;

● Take the Nasari vectors associated with the disambiguations in D and compute the centroid of D;

● Re-disambiguate the mentions associated with the remaining disambiguations with the sense having the closest Nasari vector to the centroid:

Similarity Score

Experimental Evaluation

● Intrinsic Evaluation: Annotation Quality4 languages, 2 human judges per language, 50 random sentences for each configuration (baseline, stage 1, stage 2)

EN FR DE ES

76.1

80.381.5

59.1

67.9

71.8

80.4

84.6

89.3

67.5

76.7

82.5Stage 1

Stage 2

Babelfy (baseline)

90

80

70

60

50

● Extrinsic Evaluation: Word Sense Disambiguation EuroSense’s English sense annotations as training set for a supervised WSD system: It Makes Sense (IMS) [5]

SemEval-2013 SemEval-201550

55

60

65

70

63

65.3 6566.4

67.8

69.3 69.1 69.5

IMS + OMSTI [6]

IMS + EuroSense

IMS + SemCor

MCS

References[1] J. Camacho-Collados, M. T. Pilehvar, and R. Navigli. 2016. Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation of concepts and entities. AIJ, 240:36–64.[2] P. Koehn. 2005. Europarl: A Parallel Corpus for Statistical Machine Translation. MT summit. vol. 5, pp. 79–86.[3] A. Moro, A. Raganato, and R. Navigli. 2014. Entity Linking meets Word Sense Disambiguation: a Unified Approach. TACL, 2:231–244.[4] R. Navigli and S. P. Ponzetto. 2012. BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. AIJ, 193:217–250.[5] Z. Zhong and H. T. Ng. 2010. It Makes Sense: A Wide-Coverage Word Sense Disambiguation System for Free Text. ACL: System Demonstrations, pp. 78–83.[6] K. Taghipour and H. T. Ng. 2015. One Million Sense-tagged instances for word sense disambiguation and induction. CoNLL, pp. 338–344.

Prec

ision

(%)

F-Sc

ore

(%)

The authors gratefully acknowledge the support of the ERC Consolidator Grant MOUSSE Contract no. 726487

Claudio Delli Bovi is supported by a Sapienza Research Grant ‘Avvio alla Ricerca 2016’. Jose Camacho-Collados is supported by a Google PhD Fellowship in NLP