Upload
finola
View
43
Download
0
Embed Size (px)
DESCRIPTION
S. NP. VP. N. NP. V. S. Det. N. VP. NP. VP. John. ate. an. NP. apple. V. John. ne. ek. seb. khaya. John. ate. an. ate. an. apple. apple. John. ne. ek. seb. khaya. ek. seb. khaya. NP. VP. S. NP. V. S. S. VP. VP. VP. NP. VP. NP. ne. VP. V. NP. - PowerPoint PPT Presentation
Citation preview
Linguistic Structure and Bilingual Informants Help Induce Machine Translation of Lesser-Resourced Languages
Christian Monson, Ariadna Font Llitjós, Vamshi Ambati, Lori Levin, Alon Lavie, Alison Alvarez, Roberto Aranovich, Jaime Carbonell,
Robert Frederking, Erik Peterson, Katharina Probst
Paradigms Organize Inflectional MorphologyParadigm Discovery in 3 Steps
Spanish
Paradigm CellsInflection Class
ar er ir1st, Sg, Present o o o2nd, Sg, Present as es es3rd, Sg, Present a e e1st, Pl, Present amos emos imos
... … … …
Hab Mode ReportPol / Mood
TenseObj Agr
ke pe (ü)rkela a
fiki fu
Ø Ø Ønu afu
ØØ Ø
Mapudungun (Non-Indoeuropean, Central Chile)
Subj Agr / Mood(ü)n
lichiyu…
Loc Asp
pa tu
pu ka
Ø Ø
Cross-linguistically, languages inflect using paradigms—sets of mutually exclusive morphological operations.
1. Search – Greedy bottom-up search through an empirical network of candidate partial paradigms. Here, red candidate paradigms are active in search
2. Cluster – Hierarchical agglomerative clustering adapted to the peculiarities of partial paradigms
3. Filter – Improve precision by removing unclustered and unlikely candidates
Spanish data guided algorithm development and parameter adjustment
Results
Next (Thursday), block out:
1) rule learning
2) Rule Refinement
Then (Friday), fill in and update details
Finally (Next week), make look perfect
Morpho Challenge 2007
Competition for unsupervised morphology induction algorithms
English
3rd Place Overall
Bested Morfessor (Creutz, 2006) a state-of-the-art unsupervised morphology induction algorithm
German
1st Place with Combined ParaMor-Morfessor System
S
NP VP
N
John
V NP
Det N
appleanate
John sebekne khaya
VP
V NP
appleanate
sebek khaya
VP
NP V
S
NP VP
John appleate
John ekne khaya
S
NP VP
seb
an
S
NP VP
S
NP VPne
VP
V NP
VP
NP V
Some Kind of Results
SL: pu püchükeche awkantu y kiñe awkantunTL: niños jugaron un juego AL: ((1,1),(2,1)),(3,2),(4,2),(5,3),(6,4))
Action 1: add (W1=los)
C_TL: los niños jugaron un juego CAL: ((1,2),(2,2)),(3,3),(4,3),(5,4),(6,5))
Some Kind of Results
This needs to be in the form of trees for this poster
Linguistic Structure and Bilingual Informants Help Induce Machine Translation of Lesser-Resourced Languages
Christian Monson, Ariadna Font Llitjós, Vamshi Ambati, Lori Levin, Alon Lavie, Alison Alvarez, Roberto Aranovich, Jaime Carbonell,
Robert Frederking, Erik Peterson, Katharina Probst
Monolingual Text
Morphologically Analyzed Text
Unsupervised Morphology Induction
Paradigms Organize Inflectional MorphologyEnglish
Paradigm Cells
Inflection Class‘eat’ ‘silent-e’
Unmarked eat dance, erase, …Present, 3rd eats dances, erases, …Past Tense ate danced, erased, …Progressive eating dancing, erasing, …
Passive eaten danced, erased, …
Paradigm Discovery in 3 Steps
2. Cluster Candidate Paradigms 3. Filter Unlikely Candidates
Spanish
Paradigm CellsInflection Class
ar er ir1st, Sg, Present o o o2nd, Sg, Present as es es3rd, Sg, Present a e e1st, Pl, Present amos emos imos
... … … …
Hab Mode ReportPol / Mood
TenseObj Agr
ke pe (ü)rkela a
fiki fu
Ø Ø Ønu afu
ØØ Ø
Mapudungun (Non-Indoeuropean, Central Chile)
Subj Agr / Mood(ü)n
lichiyu…
Loc Asp
pa tu
pu ka
Ø Ø
Results
Inflectional & Derivational Inflectional Only
English German English German
P R F1 σ(F1) P R F1 σ(F1) P R F1 σ(F1) P R F1 σ(F1)
ParaMor 48.9 53.6 51.1 0.8 60.0 33.5 43.0 0.7 33.0 81.4 47.0 0.9 42.8 68.6 52.7 0.8
Morfessor 73.6 34.0 46.5 1.1 66.9 37.1 47.7 0.7 53.3 47.0 49.9 1.3 38.7 44.2 41.2 0.8
Segmentation Evaluation Methodology
Morpho Challenge 2007
Competition for unsupervised morphology induction algorithms
English
3rd Place Overall
Bested Morfessor (Creutz, 2006) a state-of-the-art unsupervised morphology induction algorithm
German
1st Place with Combined ParaMor-Morfessor System
Small Candidates contain few affixes and cover few types
Incorrect Morpheme Boundary Candi- dates segment too far to the left.
Ø.ipo covers 8 words
Ø.e.iu covers 12 words
iza.izado.izan.izar.izaron.izarán.izó
der.derá.dido.diendo.dieron.dió.día
16: a.aba.ada.adas.ado.ados.an.ando.ar.ara.aron.arse.ará.arán.aría.ó
15: a.aba.ada.adas.ado.ados.an.ando.ar.ara.aron.arse.ará.arán.ó
15: a.aba.ada.adas.ado.ados.an.ando.ar.aron.arse.ará.arán.aría.ó
15: a.aba.aban.ada.adas.ado.ados.an.ando.ar.aron.arse.ará.arán.ó
17: a.aba.aban.ada.adas.ado.ados.an.ando.ar.ara.aron.arse.ará.arán.aría.ó
llega
Error analysis identified 2 major categories of incorrect candidates
1. Match word to segment against clustered affixes
2. Replace any matched affix with new affix from cluster
3. Segment the original word, if the corpus contains the hypothesized word form
lleg aballeg abanlleg ada …
lleg +a
1. Sample pairs of words that share morphemes.
Precision:Sample pairs sharing a morpheme in the automatic analyses
Recall: Sample pairs from an answer key of morphologically analyzed words
2. Examine corresponding analyses
Precsion: Count sampled pairs that share a morpheme in the answer key
Recall: Count sampled pairs that share a morpheme in the automatic analyses
Cross-linguistically, languages inflect using paradigms—sets of mutually exclusive cells. Exactly one cell from each paradigm can be filled (by an affix) in a surface word form.
1. Search – Greedy bottom-up search through an empirical network of candidate partial paradigms. Here, red candidate paradigms are active in search
2. Cluster – Hierarchical agglomerative clustering adapted to the peculiarities of partial paradigms
3. Filter – Improve precision by removing unclustered and unlikely candidates
Spanish data guided algorithm development and parameter adjustment
1. Recall Centric Search
e.er.erá.ido.ieron.ió28: deb, escog, ofrec, roconoc, vend, ...
e.ido.ieron.ir.irá.ió28: asist, dirig, exig, ocurr, sufr, ...
e.erá.ido.ieron.ió28: deb, escog, ...
e.er.ido.ieron.ió46: deb, parec, recog...
e.ido.ieron.irá.ió28: asist, dirig, ...
e.ido.ieron.ir.ió39: asist, bat, sal, ...
e.er.erá.ieron.ió32: deb, padec, romp, ...
e.ido.ieron.ió86: asist, deb, hund,...
e.erá.ieron.ió32: deb, padec, ...
er.ido.ieron.ió58: ascend, ejerc, recog, ...
ido.ieron.ir.ió44: interrump, sal, ...
azar.e.ido.ieron.ir.ió1: sal
The Next Steps
Extend ParaMor to hypothesize more than one morpheme boundary per analysis
Expand beyond suffixation to other morphological phenomena, prefixes, etc.
Merge inflection classes of the same paradigm
Identify morphophonemic changes
A Closer Look at ParaMor vs. Morfessor