Jörg Waitelonis, Henrik Jürges and Harald Sack | Dont compare Apples to Oranges - Extending GERBIL for a fine grained NEL evaluation

Don’t compare Apples to Oranges -Extending GERBIL for a fine grained NEL Evaluation

Jörg Waitelonis, Henrik Jürges, Harald Sack

Hasso-Plattner-Institute for IT-Systems Engineering, University of Potsdam

Semantics 2016, Leipzig, Germany, September 12-15th, 2016

Agenda

1. NEL and NEL evaluation

2. Dataset properties and evaluation drawbacks

3. Extending GERBIL

● Building conditional datasets

● Measure dataset characteristics

1. Results

2. Demonstration

3. Summary & Future work

Named Entity Linking (NEL)

Chart 3

… Armstrong …

Named Entity Linking (NEL), Principle

Chart 4

“Armstrong landed on the moon.”

Candidates:dbr:Neil_Armstrongdbr:Lance_Armstrongdbr:Louis_Armstrong….

Candidates:dbr:Moondbr:Lunar….

Correct entities

Entity mention with surface form

String DistanceLink AnalysisVector SpaceFuzzy String MatchingConditional Random FieldsRandom ForestRankSVMLearning to RankSurface AggregationWord EmbeddingsContext Similarity Matching

1. Tokenize text2. Find candidates in KB3. Score candidates with a

magic algorithm and select the best one

KEA

Wikifier

● Algorithm only approximates correct

entities

● Need for verification and testing

● Dataset consists out of:

■ Documents (String/Sentences)

■ Annotations (ground truth)

Named Entity Linking, Evaluation

Chart 5

ACE2004

AIDA/CoNLL

DBpedia Spotlight

IITB

KORE50

MSNBC

Micropost2014

N3-RSS-500

WES2015

N3-Reuters-128

● Traditional measures are:

■ Precision: defines how well an annotator works

■ Recall: defines how complete the results are

■ F1-measure: harmonic medium between precision and recall

■ And more, cf. Rizzo et al. [1]

● GERBIL - a general entity annotation system (AKSW Leipzig), cf.

Usbeck et. al [2]

● Used for testing/optimizing/benchmarking annotators

● Neat Webinterface

● 13 Annotators / 20 datasets

● F-measure to rough for detailed

evaluation

● Developer need dataset insights

Named Entity Linking, Benchmarking

Chart 6

Henrik Jürges Semantics 2016Leipzig, Germany

Don’t compare Apples to Oranges: Extending GERBIL for a fine grained NEL Evaluation

● Size of a datasets

● Amount of annotations/documents/words

● What types of entities are used? E.g. persons, places, events, ….

● Are there documents without annotations? E.g. Microposts 2014

● What sort of popularity have the entities? E.g. PageRank, Indegree

● How ambiguous are the entities and surface forms?

● How divers are the entities and surface forms?

● ….

Cf. van Erp et al. [3]

Properties of Datasets

Chart 7



● How does the dataset characteristics influence the evaluation

results?

● How does the popularity of entities influence the evaluation

results?

● How can a general dataset be used for domain specific NEL tools?

● How can datasets be compared? Is there something like a general

difficulty?

● Limited comparability between benchmark results

● Penalization of good annotators with inappropriate datasets

Cf. van Erp et al. [3]

Research Questions and Drawbacks

Chart 8



● Approach for a solution:

● Adjustable filter system for GERBIL

● Expose dataset characteristics

● Datasets and annotators added at runtime are also included

● Visualize the results

Extending GERBIL

Chart 9



Extending GERBIL, Conditional Datasets

Chart 10

Dataset

Type and popularity specific datasets

Annotator

Documents Annotator results

Benchmark results

Evaluate each specific dataset and result

PR(e) > t

PR(e) > t

PR(e) > t

rdf:type

rdf:type

rdf:type

rdf:type

rdf:type

rdf:type

Results, Types

Chart 11



Results, Popularity

Chart 12

Extending GERBIL, Not Annotated Documents

● Not annotated documents: shows the relative amount of empty

documents within a datasets

● Only affects if annotators searches entity mentions by themselves

Chart 13



Extending GERBIL, Density

Chart 14

● Density: shows the relation between number of annotations and

words in the document

● Only affects if annotators searches entity mentions by themselves



Extending GERBIL, Likelihood of Confusion

● Likelihood of Confusion (Level of Ambiguity)

● True measures are unknown due to missing exhaustive collections

● Rough overview how difficult to disambiguate

Chart 15

Entities Surface Forms

Tegel

TXL

Bruce

Otto Lilienthal

Bruce Lee

Bruce Willis

Airport Tegel



synonyms

Results, Likelihood of Confusion

Chart 16

Entities

● A high red bar indicates

an entity has a high

amount of homonyms

● A high blue bar indicates

a surface form has a high

amount of synonyms

Surface Forms

Extending GERBIL, Dominance of Entities

Chart 17

BruceBruci

Bruce Willis

Testdata

Vocabulary

dominance(e)= e(t)/e(v)

● Expresses the relation between

used words and all words

● True measures unknown

● High rates prevents overfitting

● Prevents repetition of surface

forms

dbr:Bruce_Willis

Bruce Walter Willis

Extending GERBIL, Dominance of Surface Forms

Chart 18Chart 18

dbr:Irene_Angelina

dbr:Angelina_Jordan

dbr:Angelina_Jolie

Vocabulary

dominance(s)= s(t)/s(v)

● Expresses the relation between

used mentions and all

mentions

● True measures unknown

● High rates prevents

overfitting

● Indicates how context

dependent a disambiguation is

Testdata

Angelina

Results, Dominance

Chart 19

● Blue bar indicates that for a

entity a variety of surface

forms is used

● Red bar indicates how context

dependent the disambiguation of

an surface form is

Dominance of surface forms Dominance of entities

Demo

● http://gerbil.s16a.org/

● https://github.com/santifa/gerbil/

Chart 20



https://gerbil.s16a.org/

https://github.com/santifa/gerbil/

■ Summary:

□ Implemented a domain specific filter system

□ Measure dataset characteristics

□ Annotator results are nearly the same on entities of different

popularity

□ Enable specific analyses and optimization of annotators

□ Enable users to select the tools that performs best for a specific

domain

■ Future work:

□ Keep up with GERBIL development, increase performance

□ More measurements, e. g. max_recall

□ Dataset remixing ≙ assemble new customized datasets

– E. g. Unpopular companies

Summary & Future work

Chart 21



[1] Giuseppe Rizzo, Amparo Elizabeth Cano Basave, Bianca Pereira, and Andrea

Varga. Making Sense of Microposts (#Microposts2015) Named Entity rEcognition and Linking (NEEL) Challenge. In 5th Workshop on Making Sense of Microposts (#Microposts2015), pages 44–53. CEUR-WS.org, 2015

[2] M. Röder, R. Usbeck, and A.-C. Ngonga Ngomo. Gerbil’s new stunts: Semantic

annotation benchmarking improved. Technical report, Leipzig University, 2016

[3] M. van Erp, P. Mendes, H. Paulheim, F. Ilievski, J. Plu, G. Rizzo, and J. Waitelonis.

Evaluating entity linking: An analysis of current benchmark datasets and a roadmap for doing a better job. In Proc. of 10th edition of the Language Resources and Evaluation Conference, Portoroz, Slovenia, 2016.

References

Chart 22



Questions?

Questions?

Thank you for your attention!

Technology

Jörg Waitelonis, Henrik Jürges and Harald Sack | Dont compare Apples to Oranges - Extending GERBIL for a fine grained NEL evaluation