Topic modeling and WSD on the Ancora corpus

Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.

Topic Modeling and WSD on the Ancora

CorpusRuben Izquierdo

Marten PostmaPiek Vossen

Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante. 2

Outline1.Starting Point2. Motivation3. Our Approach4. Evaluation Framework5. Experiments and Results6. Conclusions


Starting point“Understanding languages by machines” projectStarts from the results of DutchSemCor (WSD)Analyse the real problems of WSDUnderstand the WSD task

WordMeaningContext


Outline1. Starting Point

2.Motivation3. Our Approach4. Evaluation Framework5. Experiments and Results6. Conclusions


Still WSD?Word Sense Disambiguation is still unsolved

Used in high level applications

Recently some unsupervised approaches and SemEval tasksBabelnet, Babelfy…

Several reasons and problems


WSD problems IContext is not considered properly

Most are/were supervised approachesMoving to unsupervised, graph-based…

WSD as a black boxThe larger number of features, the better performance?The best and newest machine learning algorithm

WSD is seen as only one problemAll words and cases treated in the same way


WSD problems IIError analysis SenseEval/SemEval systems

[Postma et al., 2014]Propagation errors (monosemous)

Most Frequent Sense biasSupervised systems are skewed towards MFSError analysis on WSD and SenseEval/SemEval

Performance on MFS cases is good Very poor performance on non MFS cases


WSD problems II


WSD problems IIMost Frequent Sense bias

Supervised systems are skewed towards MFS

Error analysis on WSD and SenseEval/SemEvalPerformance on MFS cases is goodVery poor performance on non MFS casesSystems assign MFS in almost every case

Sval2799 cases where the correct is not the MFS84% of the system still assign the MFS


Outline1. Starting Point2. Motivation

3.Our Approach4. Evaluation Framework5. Experiments and Results6. Conclusions


Main ideaWSD considered as two different problems

When the MFS appliesMore general usagesLarger contexts ??

Rest of the sensesMore concrete usagesShorter contexts ??

Specialized classifiers for each case Different features, parameters, contexts…

Evaluation for Spanish Sense annotated corpus Ancora


Our approachTRAINING. Use Topic Modeling (LDA) to induce

word expert classifiersFor the Most Frequent Sense

Topics for the MFS caseTopics for non MFS cases

For the rest of senses (non MFS) Topics for every sense

CLASSIFICATION. Apply the 2 classifiers in cascade to decide the sense in every case

BINARY

MULTICLASS


Training

14

Classification



Outline1. Starting Point2. Motivation3. Our Approach

4.Evaluation Framework5. Experiments and Results6. Conclusions


Evaluation frameworkAncora corpus

News Articles, Spanish part, 500K words, sense annotated (nouns)

Converted to NAF format3 Folded-cross validation

Keeping sense distribution7119 unique lemmas annotated with nominal

senses



Spanish part, 500K words, sense annotated (nouns)3 Folded-cross validation

Keeping sense distribution7119 unique lemmas annotated

4907 are monosemous (69%)2212 are polysemous (31%)

589 with at least 3 instances per sense (from the annotated)



Spanish part, 500K words, sense annotated (nouns)

3 Folded-cross validationKeeping sense distribution

7119 unique lemmas annotated

2 3 4 5 6 7 8 9 10 11 120

200

400

600

800

1000

1200

1400Number of lemmas vs. polysemy

Number of Lemmas


Baseline ResultsFor the 589 selected lemmas

Baseline AccuracyRandom 40.10MFS overall 67.68MFS folded 68.63


Outline1. Starting Point2. Motivation3. Our Approach4. Evaluation Framework

5.Experiments and Results6. Conclusions


ExperimentationConfiguration of our cascade classifiers

Only one step with the senseLDA classifier2 steps, mfsLDA with perfect performance +

senseLDA2 steps, mfsLDA and senseLDA both induced

automaticallyLDA parameters (python gensim library)

Context size (number of sentences)Number of topics for LDA


Results IInstance Example

Sense LDA (all senses)

Word SenseOne step

classificationSentences Topic

sAccuracy

MFS baseline 68.630 3 67.54

10 65.56100 58.34

3 3 66.3010 64.62100 60.07

50 3 66.0410 63.42100 59.06

• MFS not reached• Most informative clues in

small contexts• More topics less

performance


Results IIInstance Example

MFS (100%

accuracy)


Word Sense

Two steps, MFS classifier 100% performance

Sentences Topics

Accuracy

MFS baseline 68.630 3 92.48

10 92.12100 90.50

3 3 92.4510 92.11100 91.60

50 3 92.4110 92.12100 91.43

• Extremely high figures• Good performance of the

senseLDA classifier (when no MFS)

• Similar behaviour w.r.t. #sents and # topics


Results IIIInstanc

e Exampl

e

MFS (s5)


Word Sense

Two steps, MFS classifier #S=5

Sents

Topics Acc. MFS T100

Acc. MFS T1000

MFS baseline 68.630 3 74.53 66.73

10 74.00 66.41100 72.61 64.91

3 3 74.30 66.6110 73.87 66.36100 73.39 65.76

50 3 74.26 66.4810 73.90 66.24100 73.53 65.75

• MFS s5 t100• Smaller contexts

for non MFS cases (3, 50 included by 0)

• 3 Topics is the best


Results IVInstanc

e Exampl

eMFS (s50)


Word Sense

Two steps, MFS classifier #S=50

Sents

Topics Acc. MFS T100

Acc. MFS T1000

MFS baseline 68.630 3 73.34 67.15

10 72.92 66.76100 71.43 65.13

3 3 73.21 67.0210 72.88 66.60100 72.40 66.24

50 3 73.21 66.9510 72.83 66.58100 72.15 66.20

• Similar behaviour compared to MFS_s5

• Slightly lower results

26

Lemma comparisonLemma MFS

(68.63)LDA (74.53)

Variation Annotations

año 89.15 91.19 2.04 1275país 72.29 83.55 11.26 695presidente 70.31 73.94 3.63 690partido 55.87 64.48 8.61 641equipo 98.32 98.88 0.56 539mes 54.29 80 25.71 315hora 61.39 56.11 -5.28 305caso 61.05 91.58 30.53 286mundo 47.31 40.14 -7.17 279semana 85.06 92.34 7.28 263

Most frequent lemmas



Outline1. Starting Point2. Motivation3. Our Approach4. Evaluation Framework5. Experiments and Results

6.Conclusions


Conclusions Simple approach based on LDA for WSD in Spanish Two step classification approach for WSD improves the results

for Spanish (6 points) Different nature of both cases

MFS in contexts of 5 sentences, 100 topics NonMFS in contexts in the local sentence, 3 topics

All code and data publicly available on GitHub (group policy)

http://github.com/rubenIzquierdo/lda_wsd







GRACIASRuben IzquierdoMarten PostmaPiek Vossen

email: [email protected]://github.com/rubenIzquierdo/lda_wsdhttp://rubenizquierdobevia.com

Science

Topic modeling and WSD on the Ancora corpus