Upload
giuseppe-rizzo
View
908
Download
3
Tags:
Embed Size (px)
DESCRIPTION
Talk "Learning with the web: spotting named entities on the intersection of nerd and machine learning" event during #MSM'13 (WWW'13), Rio de Janeiro, Brazil Microposts shared on social platforms instantaneously report facts, opinions or emotions. In these posts, entities are often used but they are continuously changing depending on what is currently trending. In such a scenario, recognising these named entities is a challenging task, for which off-the-shelf approaches are not well equipped. We propose NERD-ML, an approach that unifies the benefits of a crowd entity recognizer through Web entity extractors combined with the linguistic strengths of a machine learning classifier.
Citation preview
Learning with the Web: Spotting Learning with the Web: Spotting Named Entities on the intersection Named Entities on the intersection
of NERD and Machine Learningof NERD and Machine Learning
Marieke van Erp, Giuseppe Rizzo, Raphaël Troncy
@giusepperizzo
May 13, 2013 3/13Making Sense of Microposts (#MSM2013)
Preprocessing
➢ Dataset is converted in CoNLL IOB format
➢ Applied 10 cross-fold validation
➢ Chunked the set of tweets in 50KB parts in order to comply with NERD filesize limitations
May 13, 2013 4/13Making Sense of Microposts (#MSM2013)
NERD extractors
➢ Retrieves named entities from 10 extractors (Web APIs)
➢ Harmonizes the classification according to the NERD Ontology v0.5 http://nerd.eurecom.fr/ontology
➢ 75 entity classes mapped to 4 MSM'13 classes
http://nerd.eurecom.fr
May 13, 2013 5/13Making Sense of Microposts (#MSM2013)
Ritter et al. (2011)
➢ Off-the-shelf tool tailored to a Twitter stream based on:
– LabelledLDA (+CRF)– Textual features (POS,Capitalization,Suffix, etc.)– Freebase gazetters (names of PER, ORG, LOC)
➢ 10 entity classes mapped to 4 classes
Ritter, A., Clark, S., Mausam, Etzioni, O.: Named Entity Recognition in Tweets: An Experimental Study. In: Empirical Methods in Natural Language Processing (EMNLP’11) (2011)
May 13, 2013 6/13Making Sense of Microposts (#MSM2013)
Stanford CRF
➢ Re-trained on the MSM'13 corpora
➢ Parameters based on english.conll.4class.distsim.crf.ser.gz properties file provided with the Stanford distribution
➢ Baseline of our approach
Jenny Rose Finkel, Trond Grenager, and Christopher Manning. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. In: 43nd Annual Meeting of the Association for Computational Linguistics (ACL'05) (2005)
May 13, 2013 7/13Making Sense of Microposts (#MSM2013)
Textual features
➢ POS
➢ Capitalisation information– initial capital– all capitalized – proportion of token capitals
➢ Prefix (first three letters of the token)
➢ Suffix (last three letters of the token)
➢ Whether token is at the beginning of at the end of the micropost
Ritter, A., Clark, S., Mausam, Etzioni, O.: Named Entity Recognition in Tweets: An Experimental Study. In: Empirical Methods in Natural Language Processing (EMNLP’11) (2011)
May 13, 2013 8/13Making Sense of Microposts (#MSM2013)
ML settings
Run01: 7 textual features (POS, initial capital, proportion of capitals, prefix, sufix, end/start token); 0 extractor; ML=k-NN, k =1, Euclidean distance
Run02: 0 textual feature; 12 extractors (AlchemyAPI, DBpedia Spotlight, Extractiv, Lupedia, OpenCalais, Saplo, Yahoo, Textrazor, Wikimeta, Zemanta, Stanford NER, Ritter et al.); ML=SVM, polynomial kernel, SMO
Run03: 4 textual features (POS, initial capital, suffix, Proportion of Capitals); 8 extractors (AlchemyAPI, DBpedia Spotlight, Extractiv, Opencalais, Textrazor, Wikimeta, Stanford NER, Ritter et al.); ML=SVM, polynomial kernel, SMO
May 13, 2013 9/13Making Sense of Microposts (#MSM2013)
Precision – MSM'13 training,10 cross-fold validation
May 13, 2013 10/13Making Sense of Microposts (#MSM2013)
Recall - MSM'13 training,10 cross-fold validation
May 13, 2013 11/13Making Sense of Microposts (#MSM2013)
F1 – MSM'13 training,10 cross-fold validation
May 13, 2013 12/13Making Sense of Microposts (#MSM2013)
Lessons learned
➢ MISC class is ambiguously defined
➢ 8.1% of the named entities from the training data occurs in the test data
➢ Best Run03: not all extractors and some textual features
➢ For the next challenge what about entity linking?
May 13, 2013 13/13Making Sense of Microposts (#MSM2013)
Thanks for your time and attention
http://www.slideshare.net/giusepperizzo
N ERD-MLhttp://github.com/giusepperizzo/nerdml