Discovery of MWEs · Most machine learning methods use lexical resources, i.e., corpora, treebanks,...

Discovery of MWEs

Tomas Krilavičius, Justina Mandravickaitė, Michael Oakes, Carlos Ramisch, Federico Sangati, Veronika Vincze

WG3 Report on MWE Processing

Valletta, Malta19,20 March 2015

D��MWE�WG� R�� MWE P��

T�� K��, J�� M��̇, M�� O��,C�� R��, F�� S��, V�� V��

A��This poster summarizes the WG3 survey on tools and techniques for automatic MWE discovery. It serves as a ground forthe state-of-the-art report of WG3 on hybrid & multilingual MWE processing. Keywords and pointers provided here are furtherdetailed in our shared document: http://goo.gl/IE0hLC

P�� C��One-to-many correspondences are ex-

ploited in MWE detection and cross-lingualMWE detection is also enabled:

Techniques based on word alignment,dependency parsing, alignment mismatchesand/or decision trees have been used for MWEdetection. Statistical MT systems also exploitMWE-annotated paralell corpora.

A�� Lexical measures that estimate the associ-

ation strength between words are one of themain tools employed in unsupervised discoveryof MWEs in corpora. There are di�erent waysof measuring this strength of word association:

• Measures based on raw frequency• Measures based on information theory,

e.g. pointwise mutual information• Measures based on the contingency ta-

bles, e.g. chi-square• Statistical significance• Measures of association between 3 or

more words• Measures which use linguistic information

in addition to word frequencies, e.g. a�n-ity of a word to a syntactic pattern

=) No consensus about best type of measureto use in each case

S�� M�� L��Most machine learning methods use lexical

resources, i.e., corpora, treebanks, dictionar-ies, lexicons, etc. ! Dependency on certainresources makes supervised machine learn-ing approaches partly language-dependent Pri-mary lexical resources can be complementedwith web dictionaries and WordNet.

Features typically employed:• n-gram frequencies• lemmas• orthographic variations• association measures• morphosyntactic patternsCommon techniques for complementing su-

pervised machine learning: filtering, pregroup-ing, re-ranking, thresholds, combination, POStags, chunks, chunk sequences, manual anno-tation and evaluation.

T��

Corpus searches and concordancersSketch engine, AntConc, WordSmith

Association measures and patternsUCS, Text::NSP, mwetoolkit, LocalMaxs,

ACCURAT toolkit, Xtract (Dragon), bgMWEToken-based annotation/tagging

jMWE, AMALGr, FIPS-Co, StringNetRecurring tree fragments

FragmentSeeker, DiscoDOP, Varro

S�� P��Based on the non-decomposability property:the meaning of an MWE cannot be derived fromthe meanings of its component words.Context distribution methods (hot dog 6= dog)

HOT DOG

SANDWICHCAT

Substitution methods

Expression Substitution MWEBreak the vase Break the cup NOBreak the ice Break the snow YES

E��

How to evaluate a lexicon of automaticallydiscovered MWEs?

Challenging and open question

T�� K��, J�� M��̇, M�� O��,C�� R��, F�� S��, V�� V��

make a decision entscheiden

make a decision tomar uma decisão

T��

HOT DOG

SANDWICHCAT

E��

T�� K��, J�� M��̇, M�� O��,C�� R��, F�� S��, V�� V��

T��

HOT DOG

SANDWICHCAT

E��

T�� K��, J�� M��̇, M�� O��,C�� R��, F�� S��, V�� V��

T��

HOT DOG

SANDWICHCAT

E��

Discovery of MWEsMethodologies

T�� K��, J�� M��̇, M�� O��,C�� R��, F�� S��, V�� V��

T��

a/N b/Vc/J s/Ds/A d/P

a b ac s ds d a d

raw corpora

analysed corpora(POS tagged, parsed...)

a b ac s ds d a d

a bd a d

annotated corpora

[ranked]MWE lists

HOT DOG

SANDWICHCAT

E��

T�� K��, J�� M��̇, M�� O��,C�� R��, F�� S��, V�� V��

T��

HOT DOG

SANDWICHCAT

E��

3. Use/integrate2. Measure

1. Annotate/judge

MWEw1 w2 w3

w1 w4w2 w7 w1

MWE Score Annot.

Threshold

Precision: X%Recall (?): Y%

Parser+MWEsParser

Discovery of MWEsTools & Evaluation

Thank you!

Discovery of MWEs

Tomas Krilavičius, Justina Mandravickaitė, Michael Oakes, Carlos Ramisch, Federico Sangati, Veronika Vincze

WG3 Report on MWE Processing

Discovery of MWEs · Most machine learning methods use lexical resources, i.e., corpora, treebanks,...

Documents

Treebanks, Linguistic Theories and Applications

Extracting Headless MWEs from Dependency Parse Trees ... · coordinating conjunctions, and “ﬂat”, or headless MWEs. While the proper treatment of headless ... pre-trained contextualized

Corpus Linguistics (L615) - Computational Linguisticscl.indiana.edu/~md7/13/615/slides/10-treebanks/10-treebanks.pdf · Corpus Linguistics Syntactic Annotation and Treebanks ...

Computingwith AffectiveLexicons - Stanford Universityjurafsky/slp3/slides/21_SentLex.pdf · Computingwith AffectiveLexicons Affective,)Sentimental,) and)Connotative) Meaning)in)the)Lexicon

A Dependency Parser for Tweets · 2017-10-03 · /30 Multiword Expressions (MWEs) Annotator’s freedom to group words as explicit MWEs: proper names: Justin Bieber, World Series

Ra 9504 MWEs Tax Exemption

Chapter 26 Comparing Lexicons Cross-Linguistically Asifa

AnnCorra : TreeBanks for Indian Languages Guidelines for …docshare01.docshare.tips/files/20536/205364421.pdf · 2018. 11. 29. · AnnCorra : TreeBanks for Indian Languages Guidelines

Law and Early Modern English Lexicons · 2013-07-01 · Law and Early Modern English Lexicons Ian Lancashire University of Toronto 1. The First Printed English Legal Lexicons John

Parallel TreeBanks: Observations for Implication of … · 86 OLEG KAPANADZE A significant part of modern treebanking literature is devoted to creation of large Treebanks for the

Diachronous Conceptual Lexicons as Linked Open Data

Guidelines for the Syntactic Annotation of Latin Treebanks ...Bamman+Crane+Raynaud... · Guidelines for the Syntactic Annotation of Latin Treebanks (v. 1.3) David Bamman1, Marco Passarotti2,

Feasibility Study on the Merger Between MECS and MWES

Frequent Structure Discovery in Treebanks - · PDF fileMany languages also have idioms that translate this phrase, ... Frequent Structure Discovery in Treebanks 101 identifyingfrequentlyreoccurringconnectedsubtreesintreebanks,butitisequally

Computing with Affective Lexicons

Sentiment analytics: Lexicons construction and analysis

Introduction Motivation Linguistic Levels Types of MWEs Approaches to identify MWEs Limitations Conclusion References

LARGE LEXICONS FOR NATURAL LANGUAGE PROCESSING: UTILISING ... · LARGE LEXICONS FOR NATURAL LANGUAGE PROCESSING: UTILISING ... the Longman Dictionary of Contemporary English to

Emergent Adaptive Lexicons

Building Lexicons