1
ParaMor Minimally Supervised Induction of Paradigm Structure and Morphological Analysis Christian Monson, Jaime Carbonell, Alon Lavie, Lori Levin Monolingua l Text Morphologica lly Analyzed Text Unsupervised Morphology Induction Paradigms Organize Inflectional Morphology English Paradigm Cells Inflection Class ‘eat’ ‘silent-e’ Unmarked eat dance, erase, … Present, 3 rd eats dances, erases, Past Tense ate danced, erased, Progressi ve eatin g dancing, erasing, … Passive eaten danced, erased, Paradigm Discovery in 3 Steps 2. Cluster Candidate Paradigms 3. Filter Unlikely Candidates Spanish Paradigm Cells Inflection Class ar er ir 1 st , Sg, Present o o o 2 nd , Sg, Present as es es 3 rd , Sg, Present a e e 1 st , Pl, Present amos emos imos ... Hab Mode Repor t Pol / Mood Tense Obj Agr ke pe (ü)rk e la a fi ki fu Ø Ø Ø nu afu Ø Ø Ø Mapudungun (Non-Indoeuropean, Central Chile) Subj Agr / Mood (ü)n li chi yu Loc Asp pa tu pu ka Ø Ø Results Inflectional & Derivational Inflectional Only English German English German P R F 1 σ(F 1 ) P R F 1 σ(F 1 ) P R F 1 σ(F 1 ) P R F 1 σ(F 1 ) ParaMor 48. 9 53. 6 51. 1 0.8 60. 0 33. 5 43. 0 0.7 33. 0 81. 4 47. 0 0.9 42. 8 68. 6 52. 7 0.8 Morfesso r 73. 6 34. 0 46. 5 1.1 66. 9 37. 1 47. 7 0.7 53. 3 47. 0 49. 9 1.3 38. 7 44. 2 41. 2 0.8 Segmentation Evaluation Methodology Morpho Challenge 2007 Competition for unsupervised morphology induction algorithms English 3 rd Place Overall Bested Morfessor (Creutz, 2006) a state-of-the-art unsupervised morphology induction algorithm German 1 st Place with Combined ParaMor- Morfessor System Small Candidates contain few affixes and cover few types Incorrect Morpheme Boundary Candi- dates segment too far to the left. Ø.ipo covers 8 words Ø.e.iu covers 12 words iza.izado.izan.izar.izaron.iza rán.izó der.derá.dido.diendo.dieron.di ó.día 16: a.aba.ada.adas.ado.ados.an.ando.ar.ara.aron.arse.ará.arán.aría15: a.aba.ada.adas.ado.ados.an.ando.ar.ara.aron.arse.ará.arán.ó 15: a.aba.ada.adas.ado.ados.an.ando.ar.aron.arse.ará.arán.aría15: a.aba.aban.ada.adas.ado.ados.an.ando.ar.aron.arse.ará.arán.ó 17: a.aba.aban.ada.adas.ado.ados.an.ando.ar.ara.aron.arse.ará.arán.aríallega Error analysis identified 2 major categories of incorrect candidates 1.Match word to segment against clustered affixes 2.Replace any matched affix with new affix from cluster 3.Segment the original word, if the corpus contains the hypothesized word form lleg aba lleg aban lleg ada lleg +a 1.Sample pairs of words that share morphemes. Precision:Sample pairs sharing a morpheme in the automatic analyses Recall: Sample pairs from an answer key of morphologically analyzed words 2.Examine corresponding analyses Precsion: Count sampled pairs that share a morpheme in the answer key Recall: Count sampled pairs that share a morpheme in the automatic analyses Cross-linguistically, languages inflect using paradigms—sets of mutually exclusive cells. Exactly one cell from each paradigm can be filled (by an affix) in a surface word form. 1.Search – Greedy bottom-up search through an empirical network of candidate partial paradigms. Here, red candidate paradigms are active in search 2.Cluster – Hierarchical agglomerative clustering adapted to the peculiarities of partial paradigms 3.Filter – Improve precision by removing unclustered and unlikely candidates Spanish data guided algorithm development and parameter adjustment 1. Recall Centric Search e.er.erá.ido.ieron.ió 28: deb, escog, ofrec, roconoc, vend, ... e.ido.ieron.ir.irá.ió 28: asist, dirig, exig, ocurr, sufr, ... e.erá.ido.ieron.i ó 28: deb, escog, ... e.er.ido.ieron.ió 46: deb, parec , recog ... e.ido.ieron.irá. 28: asist, dirig, ... e.ido.ieron.ir.ió 39: asist, bat , sal, ... e.er.erá.ieron.ió 32: deb, padec , romp , ... e.ido.ieron.ió 86: asist, deb, hund ,... e.erá.ieron.ió 32: deb, padec, ... er.ido.ieron.ió 58: ascend , ejerc, recog, ... ido.ieron.ir.ió 44: interrump , sal, ... azar.e.ido.ieron.ir. 1: sal The Next Steps Extend ParaMor to hypothesize more than one morpheme boundary per analysis Expand beyond suffixation to other morphological phenomena, prefixes, etc. Merge inflection classes of the same paradigm A Closer Look at ParaMor vs. Morfessor

ParaMor Minimally Supervised Induction of Paradigm Structure and Morphological Analysis

Embed Size (px)

DESCRIPTION

ParaMor Minimally Supervised Induction of Paradigm Structure and Morphological Analysis. Christian Monson, Jaime Carbonell, Alon Lavie, Lori Levin. Monolingual Text. Unsupervised Morphology Induction. Morphologically Analyzed Text. Paradigms Organize Inflectional Morphology. - PowerPoint PPT Presentation

Citation preview

Page 1: ParaMor Minimally Supervised Induction of Paradigm Structure and Morphological Analysis

ParaMorMinimally Supervised Induction of Paradigm

Structure and Morphological AnalysisChristian Monson, Jaime Carbonell, Alon Lavie, Lori Levin

Monolingual Text

Morphologically Analyzed Text

Unsupervised Morphology Induction

Paradigms Organize Inflectional MorphologyEnglish

Paradigm Cells

Inflection Class‘eat’ ‘silent-e’

Unmarked eat dance, erase, …Present, 3rd eats dances, erases, …Past Tense ate danced, erased, …Progressive eating dancing, erasing, …

Passive eaten danced, erased, …

Paradigm Discovery in 3 Steps

2. Cluster Candidate Paradigms 3. Filter Unlikely Candidates

Spanish

Paradigm CellsInflection Class

ar er ir1st, Sg, Present o o o2nd, Sg, Present as es es3rd, Sg, Present a e e1st, Pl, Present amos emos imos

... … … …

Hab Mode ReportPol / Mood

TenseObj Agr

ke pe (ü)rkela a

fiki fu

Ø Ø Ønu afu

ØØ Ø

Mapudungun (Non-Indoeuropean, Central Chile)

Subj Agr / Mood(ü)n

lichiyu…

Loc Asp

pa tu

pu ka

Ø Ø

Results

Inflectional & Derivational Inflectional Only

English German English German

P R F1 σ(F1) P R F1 σ(F1) P R F1 σ(F1) P R F1 σ(F1)

ParaMor 48.9 53.6 51.1 0.8 60.0 33.5 43.0 0.7 33.0 81.4 47.0 0.9 42.8 68.6 52.7 0.8

Morfessor 73.6 34.0 46.5 1.1 66.9 37.1 47.7 0.7 53.3 47.0 49.9 1.3 38.7 44.2 41.2 0.8

Segmentation Evaluation Methodology

Morpho Challenge 2007

Competition for unsupervised morphology induction algorithms

English

3rd Place Overall

Bested Morfessor (Creutz, 2006) a state-of-the-art unsupervised morphology induction algorithm

German

1st Place with Combined ParaMor-Morfessor System

Small Candidates contain few affixes and cover few types

Incorrect Morpheme Boundary Candi- dates segment too far to the left.

Ø.ipo covers 8 words

Ø.e.iu covers 12 words

iza.izado.izan.izar.izaron.izarán.izó

der.derá.dido.diendo.dieron.dió.día

16: a.aba.ada.adas.ado.ados.an.ando.ar.ara.aron.arse.ará.arán.aría.ó

15: a.aba.ada.adas.ado.ados.an.ando.ar.ara.aron.arse.ará.arán.ó

15: a.aba.ada.adas.ado.ados.an.ando.ar.aron.arse.ará.arán.aría.ó

15: a.aba.aban.ada.adas.ado.ados.an.ando.ar.aron.arse.ará.arán.ó

17: a.aba.aban.ada.adas.ado.ados.an.ando.ar.ara.aron.arse.ará.arán.aría.ó

llega

Error analysis identified 2 major categories of incorrect candidates

1. Match word to segment against clustered affixes

2. Replace any matched affix with new affix from cluster

3. Segment the original word, if the corpus contains the hypothesized word form

lleg aballeg abanlleg ada …

lleg +a

1. Sample pairs of words that share morphemes.

Precision:Sample pairs sharing a morpheme in the automatic analyses

Recall: Sample pairs from an answer key of morphologically analyzed words

2. Examine corresponding analyses

Precsion: Count sampled pairs that share a morpheme in the answer key

Recall: Count sampled pairs that share a morpheme in the automatic analyses

Cross-linguistically, languages inflect using paradigms—sets of mutually exclusive cells. Exactly one cell from each paradigm can be filled (by an affix) in a surface word form.

1. Search – Greedy bottom-up search through an empirical network of candidate partial paradigms. Here, red candidate paradigms are active in search

2. Cluster – Hierarchical agglomerative clustering adapted to the peculiarities of partial paradigms

3. Filter – Improve precision by removing unclustered and unlikely candidates

Spanish data guided algorithm development and parameter adjustment

1. Recall Centric Search

e.er.erá.ido.ieron.ió28: deb, escog, ofrec, roconoc, vend, ...

e.ido.ieron.ir.irá.ió28: asist, dirig, exig, ocurr, sufr, ...

e.erá.ido.ieron.ió28: deb, escog, ...

e.er.ido.ieron.ió46: deb, parec, recog...

e.ido.ieron.irá.ió28: asist, dirig, ...

e.ido.ieron.ir.ió39: asist, bat, sal, ...

e.er.erá.ieron.ió32: deb, padec, romp, ...

e.ido.ieron.ió86: asist, deb, hund,...

e.erá.ieron.ió32: deb, padec, ...

er.ido.ieron.ió58: ascend, ejerc, recog, ...

ido.ieron.ir.ió44: interrump, sal, ...

azar.e.ido.ieron.ir.ió1: sal

The Next Steps

Extend ParaMor to hypothesize more than one morpheme boundary per analysis

Expand beyond suffixation to other morphological phenomena, prefixes, etc.

Merge inflection classes of the same paradigm

Identify morphophonemic changes

A Closer Look at ParaMor vs. Morfessor