Learning with the Web: Spotting Named Entities on the intersection of NERD and Machine Learning

Learning with the Web: Spotting Learning with the Web: Spotting Named Entities on the intersection Named Entities on the intersection

of NERD and Machine Learningof NERD and Machine Learning

Marieke van Erp, Giuseppe Rizzo, Raphaël Troncy

@giusepperizzo

http://twitter.com/giusepperizzo

May 13, 2013 2/13Making Sense of Microposts (#MSM2013)

NERD-ML @ MSM'13

htpp://www.eurecom.fr/


Preprocessing

➢ Dataset is converted in CoNLL IOB format

➢ Applied 10 cross-fold validation

➢ Chunked the set of tweets in 50KB parts in order to comply with NERD filesize limitations



NERD extractors

➢ Retrieves named entities from 10 extractors (Web APIs)

➢ Harmonizes the classification according to the NERD Ontology v0.5 http://nerd.eurecom.fr/ontology

➢ 75 entity classes mapped to 4 MSM'13 classes

http://nerd.eurecom.fr

http://nerd.eurecom.fr/ontology

http://nerd.eurecom.fr/



Ritter et al. (2011)

➢ Off-the-shelf tool tailored to a Twitter stream based on:

– LabelledLDA (+CRF)– Textual features (POS,Capitalization,Suffix, etc.)– Freebase gazetters (names of PER, ORG, LOC)

➢ 10 entity classes mapped to 4 classes

Ritter, A., Clark, S., Mausam, Etzioni, O.: Named Entity Recognition in Tweets: An Experimental Study. In: Empirical Methods in Natural Language Processing (EMNLP’11) (2011)



Stanford CRF

➢ Re-trained on the MSM'13 corpora

➢ Parameters based on english.conll.4class.distsim.crf.ser.gz properties file provided with the Stanford distribution

➢ Baseline of our approach

Jenny Rose Finkel, Trond Grenager, and Christopher Manning. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. In: 43nd Annual Meeting of the Association for Computational Linguistics (ACL'05) (2005)



Textual features

➢ POS

➢ Capitalisation information– initial capital– all capitalized – proportion of token capitals

➢ Prefix (first three letters of the token)

➢ Suffix (last three letters of the token)

➢ Whether token is at the beginning of at the end of the micropost

Ritter, A., Clark, S., Mausam, Etzioni, O.: Named Entity Recognition in Tweets: An Experimental Study. In: Empirical Methods in Natural Language Processing (EMNLP’11) (2011)



ML settings

Run01: 7 textual features (POS, initial capital, proportion of capitals, prefix, sufix, end/start token); 0 extractor; ML=k-NN, k =1, Euclidean distance

Run02: 0 textual feature; 12 extractors (AlchemyAPI, DBpedia Spotlight, Extractiv, Lupedia, OpenCalais, Saplo, Yahoo, Textrazor, Wikimeta, Zemanta, Stanford NER, Ritter et al.); ML=SVM, polynomial kernel, SMO

Run03: 4 textual features (POS, initial capital, suffix, Proportion of Capitals); 8 extractors (AlchemyAPI, DBpedia Spotlight, Extractiv, Opencalais, Textrazor, Wikimeta, Stanford NER, Ritter et al.); ML=SVM, polynomial kernel, SMO



Precision – MSM'13 training,10 cross-fold validation



Recall - MSM'13 training,10 cross-fold validation



F1 – MSM'13 training,10 cross-fold validation



Lessons learned

➢ MISC class is ambiguously defined

➢ 8.1% of the named entities from the training data occurs in the test data

➢ Best Run03: not all extractors and some textual features

➢ For the next challenge what about entity linking?



Thanks for your time and attention

http://www.slideshare.net/giusepperizzo

N ERD-MLhttp://github.com/giusepperizzo/nerdml

http://github.com/giusepperizzo/nerdml


Technology

Learning with the Web: Spotting Named Entities on the intersection of NERD and Machine Learning