4
Discovery of MWEs Tomas Krilavičius, Justina Mandravickaitė, Michael Oakes, Carlos Ramisch, Federico Sangati, Veronika Vincze WG3 Report on MWE Processing Valletta, Malta 19,20 March 2015

Discovery of MWEs · Most machine learning methods use lexical resources, i.e., corpora, treebanks, dictionar-ies, lexicons, etc. ! Dependency on certain resources makes supervised

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Discovery of MWEs · Most machine learning methods use lexical resources, i.e., corpora, treebanks, dictionar-ies, lexicons, etc. ! Dependency on certain resources makes supervised

Discovery of MWEs

Tomas Krilavičius, Justina Mandravickaitė, Michael Oakes, Carlos Ramisch, Federico Sangati, Veronika Vincze

WG3 Report on MWE Processing

Valletta, Malta19,20 March 2015

Page 2: Discovery of MWEs · Most machine learning methods use lexical resources, i.e., corpora, treebanks, dictionar-ies, lexicons, etc. ! Dependency on certain resources makes supervised

D����������MWE�WG� R����� �� MWE P���������

T���� K����������, J������ M�������������̇, M������ O����,C����� R������, F������� S������, V������� V�����

A�������This poster summarizes the WG3 survey on tools and techniques for automatic MWE discovery. It serves as a ground forthe state-of-the-art report of WG3 on hybrid & multilingual MWE processing. Keywords and pointers provided here are furtherdetailed in our shared document: http://goo.gl/IE0hLC

P������� C������One-to-many correspondences are ex-

ploited in MWE detection and cross-lingualMWE detection is also enabled:

Techniques based on word alignment,dependency parsing, alignment mismatchesand/or decision trees have been used for MWEdetection. Statistical MT systems also exploitMWE-annotated paralell corpora.

A���������� ��������Lexical measures that estimate the associ-

ation strength between words are one of themain tools employed in unsupervised discoveryof MWEs in corpora. There are di�erent waysof measuring this strength of word association:

• Measures based on raw frequency• Measures based on information theory,

e.g. pointwise mutual information• Measures based on the contingency ta-

bles, e.g. chi-square• Statistical significance• Measures of association between 3 or

more words• Measures which use linguistic information

in addition to word frequencies, e.g. a�n-ity of a word to a syntactic pattern

=) No consensus about best type of measureto use in each case

S��������� M������ L�������Most machine learning methods use lexical

resources, i.e., corpora, treebanks, dictionar-ies, lexicons, etc. ! Dependency on certainresources makes supervised machine learn-ing approaches partly language-dependent Pri-mary lexical resources can be complementedwith web dictionaries and WordNet.

Features typically employed:• n-gram frequencies• lemmas• orthographic variations• association measures• morphosyntactic patternsCommon techniques for complementing su-

pervised machine learning: filtering, pregroup-ing, re-ranking, thresholds, combination, POStags, chunks, chunk sequences, manual anno-tation and evaluation.

T����

Corpus searches and concordancersSketch engine, AntConc, WordSmith

Association measures and patternsUCS, Text::NSP, mwetoolkit, LocalMaxs,

ACCURAT toolkit, Xtract (Dragon), bgMWEToken-based annotation/tagging

jMWE, AMALGr, FIPS-Co, StringNetRecurring tree fragments

FragmentSeeker, DiscoDOP, Varro

S������� P���������Based on the non-decomposability property:the meaning of an MWE cannot be derived fromthe meanings of its component words.Context distribution methods (hot dog 6= dog)

HOT DOG

SANDWICHCAT

DOG

Substitution methods

Expression Substitution MWEBreak the vase Break the cup NOBreak the ice Break the snow YES

E���������

How to evaluate a lexicon of automaticallydiscovered MWEs?

Challenging and open question

D����������MWE�WG� R����� �� MWE P���������

T���� K����������, J������ M�������������̇, M������ O����,C����� R������, F������� S������, V������� V�����

A�������This poster summarizes the WG3 survey on tools and techniques for automatic MWE discovery. It serves as a ground forthe state-of-the-art report of WG3 on hybrid & multilingual MWE processing. Keywords and pointers provided here are furtherdetailed in our shared document: http://goo.gl/IE0hLC

P������� C������One-to-many correspondences are ex-

ploited in MWE detection and cross-lingualMWE detection is also enabled:

make a decision entscheiden

make a decision tomar uma decisão

Techniques based on word alignment,dependency parsing, alignment mismatchesand/or decision trees have been used for MWEdetection. Statistical MT systems also exploitMWE-annotated paralell corpora.

A���������� ��������Lexical measures that estimate the associ-

ation strength between words are one of themain tools employed in unsupervised discoveryof MWEs in corpora. There are di�erent waysof measuring this strength of word association:

• Measures based on raw frequency• Measures based on information theory,

e.g. pointwise mutual information• Measures based on the contingency ta-

bles, e.g. chi-square• Statistical significance• Measures of association between 3 or

more words• Measures which use linguistic information

in addition to word frequencies, e.g. a�n-ity of a word to a syntactic pattern

=) No consensus about best type of measureto use in each case

S��������� M������ L�������Most machine learning methods use lexical

resources, i.e., corpora, treebanks, dictionar-ies, lexicons, etc. ! Dependency on certainresources makes supervised machine learn-ing approaches partly language-dependent Pri-mary lexical resources can be complementedwith web dictionaries and WordNet.

Features typically employed:• n-gram frequencies• lemmas• orthographic variations• association measures• morphosyntactic patternsCommon techniques for complementing su-

pervised machine learning: filtering, pregroup-ing, re-ranking, thresholds, combination, POStags, chunks, chunk sequences, manual anno-tation and evaluation.

T����

Corpus searches and concordancersSketch engine, AntConc, WordSmith

Association measures and patternsUCS, Text::NSP, mwetoolkit, LocalMaxs,

ACCURAT toolkit, Xtract (Dragon), bgMWEToken-based annotation/tagging

jMWE, AMALGr, FIPS-Co, StringNetRecurring tree fragments

FragmentSeeker, DiscoDOP, Varro

S������� P���������Based on the non-decomposability property:the meaning of an MWE cannot be derived fromthe meanings of its component words.Context distribution methods (hot dog 6= dog)

HOT DOG

SANDWICHCAT

DOG

Substitution methods

Expression Substitution MWEBreak the vase Break the cup NOBreak the ice Break the snow YES

E���������

How to evaluate a lexicon of automaticallydiscovered MWEs?

Challenging and open question

D����������MWE�WG� R����� �� MWE P���������

T���� K����������, J������ M�������������̇, M������ O����,C����� R������, F������� S������, V������� V�����

A�������This poster summarizes the WG3 survey on tools and techniques for automatic MWE discovery. It serves as a ground forthe state-of-the-art report of WG3 on hybrid & multilingual MWE processing. Keywords and pointers provided here are furtherdetailed in our shared document: http://goo.gl/IE0hLC

P������� C������One-to-many correspondences are ex-

ploited in MWE detection and cross-lingualMWE detection is also enabled:

Techniques based on word alignment,dependency parsing, alignment mismatchesand/or decision trees have been used for MWEdetection. Statistical MT systems also exploitMWE-annotated paralell corpora.

A���������� ��������Lexical measures that estimate the associ-

ation strength between words are one of themain tools employed in unsupervised discoveryof MWEs in corpora. There are di�erent waysof measuring this strength of word association:

• Measures based on raw frequency• Measures based on information theory,

e.g. pointwise mutual information• Measures based on the contingency ta-

bles, e.g. chi-square• Statistical significance• Measures of association between 3 or

more words• Measures which use linguistic information

in addition to word frequencies, e.g. a�n-ity of a word to a syntactic pattern

=) No consensus about best type of measureto use in each case

S��������� M������ L�������Most machine learning methods use lexical

resources, i.e., corpora, treebanks, dictionar-ies, lexicons, etc. ! Dependency on certainresources makes supervised machine learn-ing approaches partly language-dependent Pri-mary lexical resources can be complementedwith web dictionaries and WordNet.

Features typically employed:• n-gram frequencies• lemmas• orthographic variations• association measures• morphosyntactic patternsCommon techniques for complementing su-

pervised machine learning: filtering, pregroup-ing, re-ranking, thresholds, combination, POStags, chunks, chunk sequences, manual anno-tation and evaluation.

T����

Corpus searches and concordancersSketch engine, AntConc, WordSmith

Association measures and patternsUCS, Text::NSP, mwetoolkit, LocalMaxs,

ACCURAT toolkit, Xtract (Dragon), bgMWEToken-based annotation/tagging

jMWE, AMALGr, FIPS-Co, StringNetRecurring tree fragments

FragmentSeeker, DiscoDOP, Varro

S������� P���������Based on the non-decomposability property:the meaning of an MWE cannot be derived fromthe meanings of its component words.Context distribution methods (hot dog 6= dog)

HOT DOG

SANDWICHCAT

DOG

Substitution methods

Expression Substitution MWEBreak the vase Break the cup NOBreak the ice Break the snow YES

E���������

How to evaluate a lexicon of automaticallydiscovered MWEs?

Challenging and open question

D����������MWE�WG� R����� �� MWE P���������

T���� K����������, J������ M�������������̇, M������ O����,C����� R������, F������� S������, V������� V�����

A�������This poster summarizes the WG3 survey on tools and techniques for automatic MWE discovery. It serves as a ground forthe state-of-the-art report of WG3 on hybrid & multilingual MWE processing. Keywords and pointers provided here are furtherdetailed in our shared document: http://goo.gl/IE0hLC

P������� C������One-to-many correspondences are ex-

ploited in MWE detection and cross-lingualMWE detection is also enabled:

Techniques based on word alignment,dependency parsing, alignment mismatchesand/or decision trees have been used for MWEdetection. Statistical MT systems also exploitMWE-annotated paralell corpora.

A���������� ��������Lexical measures that estimate the associ-

ation strength between words are one of themain tools employed in unsupervised discoveryof MWEs in corpora. There are di�erent waysof measuring this strength of word association:

• Measures based on raw frequency• Measures based on information theory,

e.g. pointwise mutual information• Measures based on the contingency ta-

bles, e.g. chi-square• Statistical significance• Measures of association between 3 or

more words• Measures which use linguistic information

in addition to word frequencies, e.g. a�n-ity of a word to a syntactic pattern

=) No consensus about best type of measureto use in each case

T����

Corpus searches and concordancersSketch engine, AntConc, WordSmith

Association measures and patternsUCS, Text::NSP, mwetoolkit, LocalMaxs,

ACCURAT toolkit, Xtract (Dragon), bgMWEToken-based annotation/tagging

jMWE, AMALGr, FIPS-Co, StringNetRecurring tree fragments

FragmentSeeker, DiscoDOP, Varro

S������� P���������Based on the non-decomposability property:the meaning of an MWE cannot be derived fromthe meanings of its component words.Context distribution methods (hot dog 6= dog)

HOT DOG

SANDWICHCAT

DOG

Substitution methods

Expression Substitution MWEBreak the vase Break the cup NOBreak the ice Break the snow YES

E���������

How to evaluate a lexicon of automaticallydiscovered MWEs?

Challenging and open question

Discovery of MWEsMethodologies

Page 3: Discovery of MWEs · Most machine learning methods use lexical resources, i.e., corpora, treebanks, dictionar-ies, lexicons, etc. ! Dependency on certain resources makes supervised

D����������MWE�WG� R����� �� MWE P���������

T���� K����������, J������ M�������������̇, M������ O����,C����� R������, F������� S������, V������� V�����

A�������This poster summarizes the WG3 survey on tools and techniques for automatic MWE discovery. It serves as a ground forthe state-of-the-art report of WG3 on hybrid & multilingual MWE processing. Keywords and pointers provided here are furtherdetailed in our shared document: http://goo.gl/IE0hLC

P������� C������One-to-many correspondences are ex-

ploited in MWE detection and cross-lingualMWE detection is also enabled:

Techniques based on word alignment,dependency parsing, alignment mismatchesand/or decision trees have been used for MWEdetection. Statistical MT systems also exploitMWE-annotated paralell corpora.

A���������� ��������Lexical measures that estimate the associ-

ation strength between words are one of themain tools employed in unsupervised discoveryof MWEs in corpora. There are di�erent waysof measuring this strength of word association:

• Measures based on raw frequency• Measures based on information theory,

e.g. pointwise mutual information• Measures based on the contingency ta-

bles, e.g. chi-square• Statistical significance• Measures of association between 3 or

more words• Measures which use linguistic information

in addition to word frequencies, e.g. a�n-ity of a word to a syntactic pattern

=) No consensus about best type of measureto use in each case

T����

ababs

a/N b/Vc/J s/Ds/A d/P

a b ac s ds d a d

raw corpora

analysed corpora(POS tagged, parsed...)

a b ac s ds d a d

a bd a d

annotated corpora

[ranked]MWE lists

Corpus searches and concordancersSketch engine, AntConc, WordSmith

Association measures and patternsUCS, Text::NSP, mwetoolkit, LocalMaxs,

ACCURAT toolkit, Xtract (Dragon), bgMWEToken-based annotation/tagging

jMWE, AMALGr, FIPS-Co, StringNetRecurring tree fragments

FragmentSeeker, DiscoDOP, Varro

S������� P���������Based on the non-decomposability property:the meaning of an MWE cannot be derived fromthe meanings of its component words.Context distribution methods (hot dog 6= dog)

HOT DOG

SANDWICHCAT

DOG

Substitution methods

Expression Substitution MWEBreak the vase Break the cup NOBreak the ice Break the snow YES

E���������

How to evaluate a lexicon of automaticallydiscovered MWEs?

Challenging and open question

D����������MWE�WG� R����� �� MWE P���������

T���� K����������, J������ M�������������̇, M������ O����,C����� R������, F������� S������, V������� V�����

A�������This poster summarizes the WG3 survey on tools and techniques for automatic MWE discovery. It serves as a ground forthe state-of-the-art report of WG3 on hybrid & multilingual MWE processing. Keywords and pointers provided here are furtherdetailed in our shared document: http://goo.gl/IE0hLC

P������� C������One-to-many correspondences are ex-

ploited in MWE detection and cross-lingualMWE detection is also enabled:

Techniques based on word alignment,dependency parsing, alignment mismatchesand/or decision trees have been used for MWEdetection. Statistical MT systems also exploitMWE-annotated paralell corpora.

A���������� ��������Lexical measures that estimate the associ-

ation strength between words are one of themain tools employed in unsupervised discoveryof MWEs in corpora. There are di�erent waysof measuring this strength of word association:

• Measures based on raw frequency• Measures based on information theory,

e.g. pointwise mutual information• Measures based on the contingency ta-

bles, e.g. chi-square• Statistical significance• Measures of association between 3 or

more words• Measures which use linguistic information

in addition to word frequencies, e.g. a�n-ity of a word to a syntactic pattern

=) No consensus about best type of measureto use in each case

T����

Corpus searches and concordancersSketch engine, AntConc, WordSmith

Association measures and patternsUCS, Text::NSP, mwetoolkit, LocalMaxs,

ACCURAT toolkit, Xtract (Dragon), bgMWEToken-based annotation/tagging

jMWE, AMALGr, FIPS-Co, StringNetRecurring tree fragments

FragmentSeeker, DiscoDOP, Varro

S������� P���������Based on the non-decomposability property:the meaning of an MWE cannot be derived fromthe meanings of its component words.Context distribution methods (hot dog 6= dog)

HOT DOG

SANDWICHCAT

DOG

Substitution methods

Expression Substitution MWEBreak the vase Break the cup NOBreak the ice Break the snow YES

E���������

How to evaluate a lexicon of automaticallydiscovered MWEs?

3. Use/integrate2. Measure

1. Annotate/judge

MWEw1 w2 w3

w1 w4w2 w7 w1

MWE Score Annot.

Threshold

Precision: X%Recall (?): Y%

xi

Parser+MWEsParser

Challenging and open question

Discovery of MWEsTools & Evaluation

Page 4: Discovery of MWEs · Most machine learning methods use lexical resources, i.e., corpora, treebanks, dictionar-ies, lexicons, etc. ! Dependency on certain resources makes supervised

Thank you!

Discovery of MWEs

Tomas Krilavičius, Justina Mandravickaitė, Michael Oakes, Carlos Ramisch, Federico Sangati, Veronika Vincze

WG3 Report on MWE Processing