Upload
erma
View
41
Download
0
Embed Size (px)
DESCRIPTION
ENTITYPRO EXPLOTING SVM FOR ITALIAN NAMED ENTITY RECOGNITION. EVALITA 2007 Frascati, September 10th 2007. Roberto Zanoli and Emanuele Pianta. TextPro. A suite of modular NLP tools developed at FBK-irst TokenPro: tokenization MorphoPro: morphological analysis - PowerPoint PPT Presentation
Citation preview
EVALITA 2007EVALITA 2007
Frascati, September 10th 2007Frascati, September 10th 2007
ENTITYPROEXPLOTING SVM FOR ITALIANEXPLOTING SVM FOR ITALIANNAMED ENTITY RECOGNITIONNAMED ENTITY RECOGNITION
Roberto Zanoli and Emanuele PiantaRoberto Zanoli and Emanuele Pianta
TextProTextPro
22
A suite of modular NLP tools developed at FBK-irst TokenPro: tokenization MorphoPro: morphological analysis TagPro: Part-of-Speech tagging LemmaPro: lemmatization EntityPro: Named Entity recognition ChunkPro: phrase chunking SentencePro: sentence splitting
Architecture designed to be efficient, scalable and robust. Cross-platform: Unix / Linux / Windows / MacOS X Multi-lingual models All modules integrated and accessible through unified command line interface
33
EntityPro’s architecture
We used YamCha, an SVM-based machine learning environment, to build EntityPro, a system exploiting a rich set of linguistic features, such as the orthographic features, prefixes and suffixes, and the occurrence in proper nouns gazetteers.
Feature Feature selectionselection
ControllerController
Feature extractionortho, prefix, suffix, dictionary,
collocation bigram
dictionary
Learning
models
ClassificationClassification
YamCha
Training
data
Test
dataFeature Feature selectionselection
EntityPro
Feature extractionortho, prefix, suffix, dictionary,
collocation bigram
TagPro
YamChaYamCha
44
• Created as generic, customizable, open source text chunker
• Can be adapted to a lot of other tag-oriented NLP tasks
• Uses state-of-the-art machine learning algorithm (SVM)
Can redefine Context (window-size) parsing-direction (forward/backward) algorithms for multi-class problem (pair wise/one vs rest)
Practical chunking time (1 or 2 sec./sentence.)
Available as C/C++ library
Support Vector MachinesSupport Vector Machines
55
Support vector machines are based on the Structural Risk Minimization principle (Vladimir N. Vapnik, 1995) from computational learning theory.
Support vector machines map input vectors to a higher dimensional space where a maximal separating hyperplane is constructed. Two parallel hyperplanes are constructed on each side of the hyperplane that separates the data. The separating hyperplane is the hyperplane that maximizes the distance between the two parallel hyperplanes.
YamCha: YamCha: Setting Window SizeSetting Window Size
66
Default setting is "F:-2..2:0.. T:-2..-1".
The window setting can be customized
Training and Tuning SetTraining and Tuning Set
77
Evalita Development set randomly split into two parts
training: 92.241 tokens
tuning : 40.348 tokens
FEATURES (1/3)FEATURES (1/3)
88
For each running word:
WORD: the word itself (both unchanged and lower-cased)e.g. Casa casa
POS: the part of speech of the word (as produced by TagPro)e.g. Oggi SS (singular noun)
AFFIX: prefixes/suffixes (1, 2, 3 or 4 chars. at the start/end of the word)
e.g. Oggi {o,og,ogg,oggi, – i,gi,ggi,oggi}
ORTHOgraphic information (e.g. capitalization, hyphenation)e.g. Oggi C (capitalized) oggi L (lowercased)
FEATURES (2/3)FEATURES (2/3)
99
COLLOCation bigrams (36.000, Italian newspapers ranked by MI values)
e.g. l’ OavvocatoOdi ORossi OCarlo B-COLTaormina I-COLha O…….
FEATURES (3/3): GAZETTeersFEATURES (3/3): GAZETTeers
1010
• TOWNS: World (main), Italian (comuni) and Trentino’s (frazioni) towns(12.000, from various internet sites)
• STOCK-MARKET: Italian and American stock market organizations (5.000, from stock market sites)
• WIKI-GEO: Wikipedia geographical locations(3.200,)
• PERSONS: Person proper names or titles(154.000, Italian phone-book, Wikipedia,)
difeso O O O Odall' O O O Oavvocato O O O TRIGMario O O O B-NAMDe O O O B-SURMurgo O O O I-SURdi O O O OVicenza GPE O O O……………..
An Example of An Example of Feature Extraction Feature Extraction
1111
difeso VSP Odall' ES Oavvocato SS OMario SPN B-PERDe E I-PERMurgo SPN I-PER, XPW O
difeso difeso d di dif dife o so eso feso L N O O O O O VSP Odall' dall' d da dal dall ' l' ll' all' L A O O O O O ES Oavvocato avvocato a av avv avvo o to ato cato L N O O O TRIG O SS OMario mario m ma mar mari o io rio ario C N O O O B-NAM O SPN B-PERDe de d e _nil_ _nil_ e de _nil_ _nil_ C N O O O B-SUR B-COL E I-PERMurgo murgo m mu mur murg o go rgo urgo C N O O O I-SUR I-COL SPN I-PER
Static vs Dynamic FeaturesStatic vs Dynamic Features
1212
STATIC FEATURES extracted for the current, previous and
following word WORD, POS, AFFIX, ORTHO,
COLLOC, GAZET
DYNAMIC FEATURES decided dynamically during tagging tag of the 3 tokens preceding the
current token.
Finding the best featuresFinding the best features
1313
Pr Re F1
baseline 75.28 68.74 71.86
+POS +1.31 +2.78 +2.11
+GAZET +6.09 +7.93 +7.09
+COLLOC +0.37 +0.54 +0.46
+CLUSTER_5-class -0.45 -0.04 -0.23
+POS+GAZET+COLLOC +6.56 +9.14 +7.95
Baseline: WORD (both unchanged and lower-cased) AFFIX
ORTHOgraphic window-size: STAT: +2,-2 DYNAMIC: -2
Finding the best window-sizeFinding the best window-size
1414
STAT DYN Pr Re F1
+2,-2 -2 81.84 77.88 79.81
+3,-3 -3 +1.03 -1,17 -0.14+6,-6 -6 +0.01 -3.14 -1.67
+1,-1 -1 +1.87 +2.46 +2.18+1,-1 -3 +2.21 +3.04 +2.64
+1,-1 -7.70 -0.72 -4.19
Given the best set of features (F1=79.81) we tried to improve F1 measure changing the window-size
Evaluating the best algorithmEvaluating the best algorithmPKI vs. PKEPKI vs. PKE
1515
Pr Re F1 tokens/sec
PKI 84.05 80.92 82.45 1400
PKE 83.22 80.16 81.66 4200
YamCha uses two implementations of SVMs: PKI and PKE.
•both are faster than the original SVMs
PKI produces the same accuracy as the original SVMs.
PKE approximates the orginal SVM, slightly less accurate but faster
Feature Contribution to the best Feature Contribution to the best configurationconfiguration
1616
Pr Re F1
Best Configuration 84.05 80.92 82.45
no POS +0.27 -0.71 -0.24 no GAZET -8.25 -8.40 -8.33 no COLLOC +0.01 -0.13 -0.06
no GAZET, no COLLOC(i.e. no external resources) -8.26 -8.49 -8.38 no ORTHO -0.96 -3.22 -2.14 no AFFIX -1.30 -2.51 -1.93
The learning curveThe learning curve
1717
Test ResultsTest Results
1818
Test-Set Pr Re F1
All 83.41 80.91 82.14
GPE 84.80 86.30 85.54
LOC 77.78 68.85 73.04
ORG 68.84 60.26 64.27
PER 91.62 92.63 92.12
Conclusion (1/2)Conclusion (1/2)
1919
A statistical approach to Named Entity Recognition for Italian based on YamCha/SVMs
Results confirm that SVMs can deal with a big number of features and that they perform at state of the art.
For the features, GAZETteers seem to be the most important feature31% error reduction
Large context (large values of window-size e.g. +6,-6) involves a significant decrease of the recall (data sparseness), 3 points.
Conclusion (2/2)Conclusion (2/2)
2020
F1 values for both PER (92.12) and GPE (85.54) appear rather good, comparing well with those obtain in CONLL2003 for English.
Recognition of LOCs (F1: 73.04) seems more problematic: we suspect that the number of LOCs in the training is insufficient for the learning algorithm.
ORGs appear to be highly ambiguous.
ExamplesExamples
2121
Token Gold Prediction
è O Ostato O Odenunciato O Odai O Ocarabinieri B-ORG Odi O OVigolo B-GPE B-GPEVattaro I-GPE I-GPE
Token Gold Prediction
è O Ostato O Ofermato O Odai O Ocarabinieri O Oed O Oin O Oseguito O Oad O Oun O Ocontrollo O O
Examples 2Examples 2
2222
Token Gold Prediction
Fontana B-PER B-PER( O OVillazzano B-ORG B-GPE) O O, O OCampo B-PER B-PER( O OBaone B-ORG B-GPE) O O, O ORao B-PER B-PER( O OAlta B-ORG B-ORGVallagarina I-ORG I-ORG) O O. O O
Token Gold Prediction
dovrà O Odare O Oa O Ovia B-ORG B-LOCSegantini I-ORG I-LOCun O Oruolo O Odiverso O O
Appendix AAppendix A
2323
Test-Set (without external resources)
Pr Re F1 All 75.79 72.43 74.07 GPE 78.56 76.51 77.53 LOC 81.08 49.18 61.22 ORG 57.09 52.28 54.58 PER 85.71 85.50 85.60
EntityProEntityPro
2424
EntityPro is a system for Named Entity Recognition (NER) based on YamCha in order to implement Support Vector Machines (SVMs).
YamCha (Yet Another Multipurpose Chunk Annotator, by Taku Kudo), is a generic, customizable, and open source text chunker.
EntityPro can exploit a rich set of linguistic features such as the Part of Speech, orthographic features and proper name gazetteers.
The system is part of TextPro, a suite of NLP tools developed at FBK-irst.