Upload
cedric-wilson
View
17
Download
0
Embed Size (px)
DESCRIPTION
ParaMor Minimally Supervised Induction of Paradigm Structure and Morphological Analysis. Christian Monson, Jaime Carbonell, Alon Lavie, Lori Levin. Monolingual Text. Unsupervised Morphology Induction. Morphologically Analyzed Text. Paradigms Organize Inflectional Morphology. - PowerPoint PPT Presentation
Citation preview
ParaMorMinimally Supervised Induction of Paradigm
Structure and Morphological AnalysisChristian Monson, Jaime Carbonell, Alon Lavie, Lori Levin
Monolingual Text
Morphologically Analyzed Text
Unsupervised Morphology Induction
Paradigms Organize Inflectional MorphologyEnglish
Paradigm Cells
Inflection Class‘eat’ ‘silent-e’
Unmarked eat dance, erase, …Present, 3rd eats dances, erases, …Past Tense ate danced, erased, …Progressive eating dancing, erasing, …
Passive eaten danced, erased, …
Paradigm Discovery in 3 Steps
2. Cluster Candidate Paradigms 3. Filter Unlikely Candidates
Spanish
Paradigm CellsInflection Class
ar er ir1st, Sg, Present o o o2nd, Sg, Present as es es3rd, Sg, Present a e e1st, Pl, Present amos emos imos
... … … …
Hab Mode ReportPol / Mood
TenseObj Agr
ke pe (ü)rkela a
fiki fu
Ø Ø Ønu afu
ØØ Ø
Mapudungun (Non-Indoeuropean, Central Chile)
Subj Agr / Mood(ü)n
lichiyu…
Loc Asp
pa tu
pu ka
Ø Ø
Results
Inflectional & Derivational Inflectional Only
English German English German
P R F1 σ(F1) P R F1 σ(F1) P R F1 σ(F1) P R F1 σ(F1)
ParaMor 48.9 53.6 51.1 0.8 60.0 33.5 43.0 0.7 33.0 81.4 47.0 0.9 42.8 68.6 52.7 0.8
Morfessor 73.6 34.0 46.5 1.1 66.9 37.1 47.7 0.7 53.3 47.0 49.9 1.3 38.7 44.2 41.2 0.8
Segmentation Evaluation Methodology
Morpho Challenge 2007
Competition for unsupervised morphology induction algorithms
English
3rd Place Overall
Bested Morfessor (Creutz, 2006) a state-of-the-art unsupervised morphology induction algorithm
German
1st Place with Combined ParaMor-Morfessor System
Small Candidates contain few affixes and cover few types
Incorrect Morpheme Boundary Candi- dates segment too far to the left.
Ø.ipo covers 8 words
Ø.e.iu covers 12 words
iza.izado.izan.izar.izaron.izarán.izó
der.derá.dido.diendo.dieron.dió.día
16: a.aba.ada.adas.ado.ados.an.ando.ar.ara.aron.arse.ará.arán.aría.ó
15: a.aba.ada.adas.ado.ados.an.ando.ar.ara.aron.arse.ará.arán.ó
15: a.aba.ada.adas.ado.ados.an.ando.ar.aron.arse.ará.arán.aría.ó
15: a.aba.aban.ada.adas.ado.ados.an.ando.ar.aron.arse.ará.arán.ó
17: a.aba.aban.ada.adas.ado.ados.an.ando.ar.ara.aron.arse.ará.arán.aría.ó
llega
Error analysis identified 2 major categories of incorrect candidates
1. Match word to segment against clustered affixes
2. Replace any matched affix with new affix from cluster
3. Segment the original word, if the corpus contains the hypothesized word form
lleg aballeg abanlleg ada …
lleg +a
1. Sample pairs of words that share morphemes.
Precision:Sample pairs sharing a morpheme in the automatic analyses
Recall: Sample pairs from an answer key of morphologically analyzed words
2. Examine corresponding analyses
Precsion: Count sampled pairs that share a morpheme in the answer key
Recall: Count sampled pairs that share a morpheme in the automatic analyses
Cross-linguistically, languages inflect using paradigms—sets of mutually exclusive cells. Exactly one cell from each paradigm can be filled (by an affix) in a surface word form.
1. Search – Greedy bottom-up search through an empirical network of candidate partial paradigms. Here, red candidate paradigms are active in search
2. Cluster – Hierarchical agglomerative clustering adapted to the peculiarities of partial paradigms
3. Filter – Improve precision by removing unclustered and unlikely candidates
Spanish data guided algorithm development and parameter adjustment
1. Recall Centric Search
e.er.erá.ido.ieron.ió28: deb, escog, ofrec, roconoc, vend, ...
e.ido.ieron.ir.irá.ió28: asist, dirig, exig, ocurr, sufr, ...
e.erá.ido.ieron.ió28: deb, escog, ...
e.er.ido.ieron.ió46: deb, parec, recog...
e.ido.ieron.irá.ió28: asist, dirig, ...
e.ido.ieron.ir.ió39: asist, bat, sal, ...
e.er.erá.ieron.ió32: deb, padec, romp, ...
e.ido.ieron.ió86: asist, deb, hund,...
e.erá.ieron.ió32: deb, padec, ...
er.ido.ieron.ió58: ascend, ejerc, recog, ...
ido.ieron.ir.ió44: interrump, sal, ...
azar.e.ido.ieron.ir.ió1: sal
The Next Steps
Extend ParaMor to hypothesize more than one morpheme boundary per analysis
Expand beyond suffixation to other morphological phenomena, prefixes, etc.
Merge inflection classes of the same paradigm
Identify morphophonemic changes
A Closer Look at ParaMor vs. Morfessor