Msm2013challenge

ELIS – Multimedia Lab

Fréderic Godin, Pedro Debevere, Erik Mannens, Wesley De Neve and Rik Van de Walle

MSM2013 IE Challenge: Leveraging Existing Tools for

Named Entity Recognition in Microposts

Multimedia Lab, Ghent University – iMinds, Belgium

Image and Video Systems Lab, KAIST, South Korea

2


MSM2013 IE Challenge: Leveraging Existing Tools for Named Entity Recognition in MicropostsFréderic Godin, Pedro Debevere, Erik Mannens, Wesley De Neve and Rik Van de Walle

Making Sense of Micropost Workshop @ World Wide Web Conference 2013

Introduction: The challenge

Existing tools for NER are developed for news corpera

Develop NER tools for microposts

4 entity types: PersonLocationOrganisation Miscellaneous (film/movie, entertainment award event, political event, programming language, sporting event and TV show)

3




How do current NER tools perform? (1)

Rizzo et al. evaluated the performance of:AlchemyAPI, DBpedia Spotlight, Evri, Extractiv, OpenCalais and Zemanta

On:5 TED talks, 1000 news articles, and 217 conference abstracts.

Could we do the same evaluation for microposts?

4





Preprocessing: convert bracket tokens to brackets

Note: values can differ based on ontology mapping used!

PER LOC ORG MISCAlchemyAPI 78.20% 74.60% 54.40% 10.20%

Spotlight (0.2) 57.60% 46.40% 24.40% 5.00%Spotlight (0.5) 32.90% 3.70% 6.50% 7.30%

OpenCalais 69.30% 73.10% 55.80% 31.40%Zemanta 70.40% 64.30% 48.10% 29.30%

F1 values

5





AlchemyAPI: performs bad in recognizing exotic names, small villages, buildings and organizations

Zemanta: same as AlchemyAPI + relies on capitalisation

OpenCalais: bad in recognizing small villages, buildings and organizations. Does recognize big events!

DBpedia Spotlight: returns multiple ‘possible’ entities

What if we combine the power of all 4 services?

6




Combining existing services (1)

Apply machine learning on a feature vector of the output of the different services

AlchemyAPI DBpedia Spotlight OpenCalais Zemanta

Random Forest

Confidence level PER, LOC, ORG, MISC

Service specific entity

16 features

PER, LOC, ORG, MISC

7




Combining existing services (2)

Evaluation on entity type

PER LOC ORG MISC

Spotlight (0.2) 82.20% 75.70% 60.40% 47.40%

Spotlight (0.5) 81.60% 74.30% 59.40% 40.50%

Noisy input data gives better results

(final results on test set are not included and are part of the challenge)

8




Conclusions

Current NER tools do perform well in most cases

Shortcomings: Incorrect use of capital lettresAbbreviations of organisationsSmall villages, counties and buildings

Combining the output of several services yields good results

9




#Questions @frederic_godin #MMLab

Technology

Msm2013challenge