45
Idiom Token Classification using Sentential Distributed Semantics Giancarlo D. Salton Robert J. Ross John D. Kelleher Applied Intelligence Research Centre School of Computing

Idiom Token Classification using Sentential Distributed Semantics (Giancarlo D. Salton)

Embed Size (px)

Citation preview

Page 1: Idiom Token Classification using Sentential Distributed Semantics (Giancarlo D. Salton)

Idiom Token Classification using SententialDistributed Semantics

Giancarlo D. Salton Robert J. Ross John D. Kelleher

Applied Intelligence Research CentreSchool of Computing

Dublin, 28 September 2016

Page 2: Idiom Token Classification using Sentential Distributed Semantics (Giancarlo D. Salton)

Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup

Outline

Idioms

Distributed Representations

“Per-expression” classification

“General” classification

Conclusions

Future Work on Idiom Token Classification

Idiom Classification on Machine Translation Pipeline

2/45

Page 3: Idiom Token Classification using Sentential Distributed Semantics (Giancarlo D. Salton)

Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup

Outline

Idioms

Distributed Representations

“Per-expression” classification

“General” classification

Conclusions

Future Work on Idiom Token Classification

Idiom Classification on Machine Translation Pipeline

Idioms 3/45

Page 4: Idiom Token Classification using Sentential Distributed Semantics (Giancarlo D. Salton)

Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup

Idioms

I Idioms are multiword expressions (MWE)

I Their meaning is non-compositional

I No linguistic agreement upon the set of characteristics definingidioms

Idioms 4/45

Page 5: Idiom Token Classification using Sentential Distributed Semantics (Giancarlo D. Salton)

Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup

Idiomatic and Literal Usages

I Literally...

I Actually...

I How to distinguish between a literal and idiomatic usage?

Idiom token classificationIdioms 5/45

Page 6: Idiom Token Classification using Sentential Distributed Semantics (Giancarlo D. Salton)

Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup

Previous Work

I Previous work used “per-expression” models

– different set of features for each expression

I in general, these features are not reusable

– i.e., a model is trained for each particular expression

I In our opinion the state-of-the-art is Peng et al. (2014)

– Also “per-expression” classification

– Topic models

– Up to 5 paragraphs of context!

Idioms 6/45

Page 7: Idiom Token Classification using Sentential Distributed Semantics (Giancarlo D. Salton)

Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup

Per-expression classifiers

I Per-expression classifiers

– Expensive

I Idioms samples are rare

– Time-consuming

I Feature engineering

Idioms 7/45

Page 8: Idiom Token Classification using Sentential Distributed Semantics (Giancarlo D. Salton)

Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup

General Classifiers?

I Can we find a common set of features?

I Can we train a general classifier?

hold+horses

vs.

break+ice

vs.

spill+beans

Idioms 8/45

Page 9: Idiom Token Classification using Sentential Distributed Semantics (Giancarlo D. Salton)

Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup

Outline

Idioms

Distributed Representations

“Per-expression” classification

“General” classification

Conclusions

Future Work on Idiom Token Classification

Idiom Classification on Machine Translation Pipeline

Distributed Representations 9/45

Page 10: Idiom Token Classification using Sentential Distributed Semantics (Giancarlo D. Salton)

Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup

Distributed Representations of Words

I Word2vec (Mikolov et al., 2013)

Distributed Representations 10/45

Page 11: Idiom Token Classification using Sentential Distributed Semantics (Giancarlo D. Salton)

Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup

Skip-thought Vectors (or Sent2Vec)(Kiros et al., 2015)

I Encoder/Decoder Framework

Distributed Representations 11/45

Page 12: Idiom Token Classification using Sentential Distributed Semantics (Giancarlo D. Salton)

Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup

Skip-thought Vectors (or Sent2Vec)(Kiros et al., 2015)

I Encoder/Decoder Framework

Distributed representations = features!

Distributed Representations 12/45

Page 13: Idiom Token Classification using Sentential Distributed Semantics (Giancarlo D. Salton)

Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup

Skip-thought Vectors (or Sent2Vec)(Kiros et al., 2015)

I Encoder/Decoder Framework

Distributed representations = features!

Distributed Representations 13/45

Page 14: Idiom Token Classification using Sentential Distributed Semantics (Giancarlo D. Salton)

Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup

Skip-thought Vectors (or Sent2Vec)(Kiros et al., 2015)

I Encoder/Decoder Framework

Distributed representations = features!

Distributed Representations 14/45

Page 15: Idiom Token Classification using Sentential Distributed Semantics (Giancarlo D. Salton)

Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup

Skip-thought Vectors (or Sent2Vec)(Kiros et al., 2015)

I Encoder/Decoder Framework

– Encoder learns to encode information about the context of aninput sentence

Distributed representations = features!

Distributed Representations 15/45

Page 16: Idiom Token Classification using Sentential Distributed Semantics (Giancarlo D. Salton)

Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup

Distributed Representations vs Idioms

I Distributed representations cluster words (word2vec) orsentences (sent2vec) with similar semantics

– Empirical results have shown that

I Idiomatic vs. literal usages

– Idioms should alse be in a different part of space than literalexpressions (at least when considering the same expression)

Distributed Representations 16/45

Page 17: Idiom Token Classification using Sentential Distributed Semantics (Giancarlo D. Salton)

Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup

Outline

Idioms

Distributed Representations

“Per-expression” classification

“General” classification

Conclusions

Future Work on Idiom Token Classification

Idiom Classification on Machine Translation Pipeline

“Per-expression” classification 17/45

Page 18: Idiom Token Classification using Sentential Distributed Semantics (Giancarlo D. Salton)

Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup

“Per-expression” settings

I Following baseline evaluation (Peng et al., 2014)

I 4 expressions from VNC-Tokens dataset:

– blow+whistle, lose+head, make+scene and take+heart

I Balanced training sets

I Imbalanced test sets

“Per-expression” classification 18/45

Page 19: Idiom Token Classification using Sentential Distributed Semantics (Giancarlo D. Salton)

Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup

“Per-expression” classifiers

I K-Nearest Neighbours

– 2, 3, 5 and 10 neighbours

I Support Vector Machines

– Linear SVM: linear kernel and grid search for best parameters

– Grid SVM: grid search for best kernel/parameters

– SGD SVM: linear kernel trained with Stochastic Gradient Descent

“Per-expression” classification 19/45

Page 20: Idiom Token Classification using Sentential Distributed Semantics (Giancarlo D. Salton)

Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup

blow+whistle results

Models Precision Recall F1-ScorePeng et al. (2014)FDA-Topics 0.62 0.60 0.61FDA-Topics+A 0.47 0.44 0.45FDA-Text 0.65 0.43 0.52FDA-Text+A 0.45 0.49 0.47SVMs-Topics 0.07 0.40 0.12SVMs-Topics+A 0.21 0.54 0.30SVMs-Text 0.17 0.90 0.29SVMs-Text+A 0.24 0.87 0.38Distributed RepresentationsKNN-2 0.61 0.41 0.49KNN-3 0.84 0.32 0.46KNN-5 0.79 0.28 0.41KNN-10 0.83 0.30 0.44Linear SVM 0.77 0.50 0.60Grid SVM 0.80 0.51 0.62SGD SVM 0.70 0.40 0.51

“Per-expression” classification 20/45

Page 21: Idiom Token Classification using Sentential Distributed Semantics (Giancarlo D. Salton)

Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup

lose+head results

Models Precision Recall F1-ScorePeng et al. (2014)FDA-Topics 0.76 0.97 0.85FDA-Topics+A 0.74 0.93 0.82FDA-Text 0.72 0.73 0.72FDA-Text+A 0.67 0.88 0.76SVMs-Topics 0.60 0.83 0.70SVMs-Topics+A 0.66 0.77 0.71SVMs-Text 0.30 0.50 0.38SVMs-Text+A 0.66 0.85 0.74Distributed RepresentationsKNN-2 0.30 0.64 0.41KNN-3 0.58 0.65 0.61KNN-5 0.57 0.65 0.61KNN-10 0.28 0.68 0.40Linear SVM 0.72 0.84 0.77Grid SVM 0.83 0.89 0.85SGD SVM 0.73 0.79 0.76

“Per-expression” classification 21/45

Page 22: Idiom Token Classification using Sentential Distributed Semantics (Giancarlo D. Salton)

Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup

make+scene results

Models Precision Recall F1-ScorePeng et al. (2014)FDA-Topics 0.79 0.95 0.86FDA-Topics+A 0.82 0.69 0.75FDA-Text 0.79 0.95 0.86FDA-Text+A 0.80 0.99 0.88SVMs-Topics 0.46 0.57 0.51SVMs-Topics+A 0.42 0.29 0.34SVMs-Text 0.10 0.01 0.02SVMs-Text+A 0.07 0.01 0.02Distributed RepresentationsKNN-2 0.55 0.89 0.68KNN-3 0.88 0.88 0.88KNN-5 0.87 0.83 0.85KNN-10 0.85 0.83 0.84Linear SVM 0.81 0.91 0.86Grid SVM 0.80 0.91 0.85SGD SVM 0.85 0.91 0.88

“Per-expression” classification 22/45

Page 23: Idiom Token Classification using Sentential Distributed Semantics (Giancarlo D. Salton)

Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup

take+heart results

Models Precision Recall F1-ScorePeng et al. (2014)FDA-Topics 0.93 0.99 0.96FDA-Topics+A 0.92 0.98 0.95FDA-Text 0.46 0.40 0.43FDA-Text+A 0.47 0.29 0.36SVMs-Topics 0.90 1.00 0.95SVMs-Topics+A 0.91 1.00 0.95SVMs-Text 0.65 0.21 0.32SVMs-Text+A 0.74 0.13 0.22Distributed RepresentationsKNN-2 0.46 0.96 0.62KNN-3 0.72 0.94 0.81KNN-5 0.73 0.94 0.82KNN-10 0.78 0.94 0.85Linear SVM 0.73 0.96 0.83Grid SVM 0.72 0.96 0.82SGD SVM 0.61 0.95 0.74

“Per-expression” classification 23/45

Page 24: Idiom Token Classification using Sentential Distributed Semantics (Giancarlo D. Salton)

Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup

“Per-expression” evaluation

I No single model performed best for all expressions

I SVM consistently outperformed K-NNs

I Peng et al. (2014) features may capture a different set ofdimensions

I Combination with baseline model may result in stronger classifier

“Per-expression” classification 24/45

Page 25: Idiom Token Classification using Sentential Distributed Semantics (Giancarlo D. Salton)

Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup

Outline

Idioms

Distributed Representations

“Per-expression” classification

“General” classification

Conclusions

Future Work on Idiom Token Classification

Idiom Classification on Machine Translation Pipeline

“General” classification 25/45

Page 26: Idiom Token Classification using Sentential Distributed Semantics (Giancarlo D. Salton)

Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup

“General classifier” settings

I Simulation of expected behaviour on real data

I 27 expressions of “balanced” part of VNC-Tokens dataset

I Imbalanced training set

I Imbalanced test set

“General” classification 26/45

Page 27: Idiom Token Classification using Sentential Distributed Semantics (Giancarlo D. Salton)

Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup

“General classifier” classifiers

I SVMs only

– Linear SVM: linear kernel and grid search for best parameters

– Grid SVM: grid search for best kernel/parameters

– SGD SVM: linear kernel trained with Stochastic Gradient Descent

“General” classification 27/45

Page 28: Idiom Token Classification using Sentential Distributed Semantics (Giancarlo D. Salton)

Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup

“General classifier” results

Linear SVM Grid SVM SGD SVMExpressions Pr. Rec. F1 Pr. Rec. F1 Pr. Rec. F1blow+whistle 0.84 0.67 0.75 0.84 0.68 0.75 0.67 0.59 0.63lose+head 0.78 0.66 0.72 0.75 0.64 0.69 0.75 0.67 0.71make+scene 0.92 0.84 0.88 0.92 0.81 0.86 0.78 0.81 0.79take+heart 0.94 0.79 0.86 0.94 0.80 0.86 0.86 0.80 0.83Total 0.84 0.80 0.83 0.84 0.80 0.83 0.79 0.79 0.78

“General” classification 28/45

Page 29: Idiom Token Classification using Sentential Distributed Semantics (Giancarlo D. Salton)

Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup

“General classifier” evaluation

I Expected behaviour on “real world”

– Consider imbalances of real data

I 2 classifiers had high performance

– Same general precision, recall and F1

– Deviations occurred across individual expressions

I Performance is still not consistent over all classifiers and acrossexpressions

“General” classification 29/45

Page 30: Idiom Token Classification using Sentential Distributed Semantics (Giancarlo D. Salton)

Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup

PCA Analysis of Distributed Representations on

“General” classifier

“General” classification 30/45

Page 31: Idiom Token Classification using Sentential Distributed Semantics (Giancarlo D. Salton)

Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup

Outline

Idioms

Distributed Representations

“Per-expression” classification

“General” classification

Conclusions

Future Work on Idiom Token Classification

Idiom Classification on Machine Translation Pipeline

Conclusions 31/45

Page 32: Idiom Token Classification using Sentential Distributed Semantics (Giancarlo D. Salton)

Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup

Conclusions

I Our approach needs less resources to achieve roughly the sameperformance

I SVM generally perform better than KNNs

I “General classifier” is feasible

I “Per-expression” does achieve better results in some cases

Conclusions 32/45

Page 33: Idiom Token Classification using Sentential Distributed Semantics (Giancarlo D. Salton)

Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup

Outline

Idioms

Distributed Representations

“Per-expression” classification

“General” classification

Conclusions

Future Work on Idiom Token Classification

Idiom Classification on Machine Translation Pipeline

Future Work on Idiom Token Classification 33/45

Page 34: Idiom Token Classification using Sentential Distributed Semantics (Giancarlo D. Salton)

Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup

Future Work on Idiom Token Classification

I Apply to other languages than English

I Apply to other datasets

– e.g., the IDX Corpus

I What are the main sources of error for the “general classifier”?

– Better understanding of representations is needed

Future Work on Idiom Token Classification 34/45

Page 35: Idiom Token Classification using Sentential Distributed Semantics (Giancarlo D. Salton)

Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup

Outline

Idioms

Distributed Representations

“Per-expression” classification

“General” classification

Conclusions

Future Work on Idiom Token Classification

Idiom Classification on Machine Translation Pipeline

Idiom Classification on Machine Translation Pipeline 35/45

Page 36: Idiom Token Classification using Sentential Distributed Semantics (Giancarlo D. Salton)

Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup

Idiom Token Classification on Machine TranslationPipeline

(Salton et al., 2014b)

Idiom Classification on Machine Translation Pipeline 36/45

Page 37: Idiom Token Classification using Sentential Distributed Semantics (Giancarlo D. Salton)

Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup

Idiom Token Classification on Machine TranslationPipeline

(Salton et al., 2014b)

Idiom Classification on Machine Translation Pipeline 37/45

Page 38: Idiom Token Classification using Sentential Distributed Semantics (Giancarlo D. Salton)

Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup

Idiom Token Classification on Machine TranslationPipeline

(Salton et al., 2014b)

Idiom Classification on Machine Translation Pipeline 38/45

Page 39: Idiom Token Classification using Sentential Distributed Semantics (Giancarlo D. Salton)

Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup

Idiom Token Classification on Machine TranslationPipeline

(Salton et al., 2014b)

Idiom Classification on Machine Translation Pipeline 39/45

Page 40: Idiom Token Classification using Sentential Distributed Semantics (Giancarlo D. Salton)

Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup

Idiom Token Classification on Machine TranslationPipeline

(Salton et al., 2014b)

Idiom Classification on Machine Translation Pipeline 40/45

Page 41: Idiom Token Classification using Sentential Distributed Semantics (Giancarlo D. Salton)

Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup

Idiom Token Classification on Machine TranslationPipeline

(Salton et al., 2014b)

Idiom Classification on Machine Translation Pipeline 41/45

Page 42: Idiom Token Classification using Sentential Distributed Semantics (Giancarlo D. Salton)

Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup

Idiom Token Classification on Machine TranslationPipeline

(Salton et al., 2014b)

Idiom Classification on Machine Translation Pipeline 42/45

Page 43: Idiom Token Classification using Sentential Distributed Semantics (Giancarlo D. Salton)

Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup

Idiom Token Classification on Machine TranslationPipeline

(Salton et al., 2014b)

Idiom Classification on Machine Translation Pipeline 43/45

Page 44: Idiom Token Classification using Sentential Distributed Semantics (Giancarlo D. Salton)

Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup

References

Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun,Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Advances inNeural Information Processing Systems 28, pages 3276–3284.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013.Distributed representations of words and phrases and their compositionality. InAdvances in Neural Information Processing Systems 26, pages 3111–3119.

Jing Peng, Anna Feldman, and Ekaterina Vylomova. 2014. Classifying idiomatic andliteral expressions using topic models and intensity of emotions. In Proceedings of the2014 Conference on Empirical Methods in Natural Language Processing (EMNLP),pages 2019–2027, October.

Giancarlo D. Salton, Robert J. Ross, and John D. Kelleher. 2014a. An EmpiricalStudy of the Impact of Idioms on Phrase Based Statistical Machine Translation ofEnglish to Brazilian-Portuguese. In Third Workshop on Hybrid Approaches toTranslation (HyTra), pages 36–41.

Giancarlo D. Salton, Robert J. Ross, and John D. Kelleher. 2014b. Evaluation of asubstitution method for idiom transformation in statistical machine translation. In The10th Workshop on Multiword Expressions (MWE 2014), pages 38–42.

44/45

Page 45: Idiom Token Classification using Sentential Distributed Semantics (Giancarlo D. Salton)

Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup

Thank you!

Giancarlo D. Salton would like to thank CAPES (“Coordenao deAperfeioamento de Pessoal de Nvel Superior”) for his Science Without

Borders scholarship, proc n. 9050-13-2

45/45