2
Linguistic Structure and Bilingual Informants Help Induce Machine Translation of Lesser- Resourced Languages Christian Monson, Ariadna Font Llitjós, Vamshi Ambati, Lori Levin, Alon Lavie, Alison Alvarez, Roberto Aranovich, Jaime Carbonell, Robert Frederking, Erik Peterson, Katharina Probst Paradigms Organize Inflectional Morphology Paradigm Discovery in 3 Steps Spanish Paradigm Cells Inflection Class ar er ir 1 st , Sg, Present o o o 2 nd , Sg, Present as es es 3 rd , Sg, Present a e e 1 st , Pl, Present amos emos imos ... Hab Mode Repor t Pol / Mood Tense Obj Agr ke pe (ü)rk e la a fi ki fu Ø Ø Ø nu afu Ø Ø Ø Mapudungun (Non-Indoeuropean, Central Chile) Subj Agr / Mood (ü)n li chi yu Loc Asp pa tu pu ka Ø Ø Cross-linguistically, languages inflect using paradigms—sets of mutually exclusive morphological operations. 1.Search – Greedy bottom-up search through an empirical network of candidate partial paradigms. Here, red candidate paradigms are active in search 2.Cluster – Hierarchical agglomerative clustering adapted to the peculiarities of partial paradigms 3.Filter – Improve precision by removing unclustered and unlikely candidates Spanish data guided algorithm development and parameter adjustment Results Next (Thursday), block out: Morpho Challenge 2007 Competition for unsupervised morphology induction algorithms English 3 rd Place Overall Bested Morfessor (Creutz, 2006) a state-of-the-art unsupervised morphology induction algorithm German 1 st Place with Combined ParaMor- Morfessor System S NP VP N John V NP Det N appl e an ate John seb ek ne khay a VP V NP appl e an ate seb ek khay a VP NP V S NP VP John appl e ate John ek ne khay a S NP VP seb an S NP VP S NP VP ne VP V NP VP NP V Some Kind of Results SL: pu püchükeche awkantu y kiñe awkantun TL: niños jugaron un juego AL: ((1,1),(2,1)),(3,2),(4,2),(5,3), (6,4)) Action 1: add (W 1 =los) C_TL: los niños jugaron un juego CAL: ((1,2),(2,2)),(3,3),(4,3),(5,4), (6,5)) Some Kind of Results This needs to be in the form of trees for this poster

Paradigms Organize Inflectional Morphology

  • Upload
    finola

  • View
    43

  • Download
    0

Embed Size (px)

DESCRIPTION

S. NP. VP. N. NP. V. S. Det. N. VP. NP. VP. John. ate. an. NP. apple. V. John. ne. ek. seb. khaya. John. ate. an. ate. an. apple. apple. John. ne. ek. seb. khaya. ek. seb. khaya. NP. VP. S. NP. V. S. S. VP. VP. VP. NP. VP. NP. ne. VP. V. NP. - PowerPoint PPT Presentation

Citation preview

Page 1: Paradigms Organize Inflectional Morphology

Linguistic Structure and Bilingual Informants Help Induce Machine Translation of Lesser-Resourced Languages

Christian Monson, Ariadna Font Llitjós, Vamshi Ambati, Lori Levin, Alon Lavie, Alison Alvarez, Roberto Aranovich, Jaime Carbonell,

Robert Frederking, Erik Peterson, Katharina Probst

Paradigms Organize Inflectional MorphologyParadigm Discovery in 3 Steps

Spanish

Paradigm CellsInflection Class

ar er ir1st, Sg, Present o o o2nd, Sg, Present as es es3rd, Sg, Present a e e1st, Pl, Present amos emos imos

... … … …

Hab Mode ReportPol / Mood

TenseObj Agr

ke pe (ü)rkela a

fiki fu

Ø Ø Ønu afu

ØØ Ø

Mapudungun (Non-Indoeuropean, Central Chile)

Subj Agr / Mood(ü)n

lichiyu…

Loc Asp

pa tu

pu ka

Ø Ø

Cross-linguistically, languages inflect using paradigms—sets of mutually exclusive morphological operations.

1. Search – Greedy bottom-up search through an empirical network of candidate partial paradigms. Here, red candidate paradigms are active in search

2. Cluster – Hierarchical agglomerative clustering adapted to the peculiarities of partial paradigms

3. Filter – Improve precision by removing unclustered and unlikely candidates

Spanish data guided algorithm development and parameter adjustment

Results

Next (Thursday), block out:

1) rule learning

2) Rule Refinement

Then (Friday), fill in and update details

Finally (Next week), make look perfect

Morpho Challenge 2007

Competition for unsupervised morphology induction algorithms

English

3rd Place Overall

Bested Morfessor (Creutz, 2006) a state-of-the-art unsupervised morphology induction algorithm

German

1st Place with Combined ParaMor-Morfessor System

S

NP VP

N

John

V NP

Det N

appleanate

John sebekne khaya

VP

V NP

appleanate

sebek khaya

VP

NP V

S

NP VP

John appleate

John ekne khaya

S

NP VP

seb

an

S

NP VP

S

NP VPne

VP

V NP

VP

NP V

Some Kind of Results

SL: pu püchükeche awkantu y kiñe awkantunTL: niños jugaron un juego AL: ((1,1),(2,1)),(3,2),(4,2),(5,3),(6,4))

Action 1: add (W1=los)

C_TL: los niños jugaron un juego CAL: ((1,2),(2,2)),(3,3),(4,3),(5,4),(6,5))

Some Kind of Results

This needs to be in the form of trees for this poster

Page 2: Paradigms Organize Inflectional Morphology

Linguistic Structure and Bilingual Informants Help Induce Machine Translation of Lesser-Resourced Languages

Christian Monson, Ariadna Font Llitjós, Vamshi Ambati, Lori Levin, Alon Lavie, Alison Alvarez, Roberto Aranovich, Jaime Carbonell,

Robert Frederking, Erik Peterson, Katharina Probst

Monolingual Text

Morphologically Analyzed Text

Unsupervised Morphology Induction

Paradigms Organize Inflectional MorphologyEnglish

Paradigm Cells

Inflection Class‘eat’ ‘silent-e’

Unmarked eat dance, erase, …Present, 3rd eats dances, erases, …Past Tense ate danced, erased, …Progressive eating dancing, erasing, …

Passive eaten danced, erased, …

Paradigm Discovery in 3 Steps

2. Cluster Candidate Paradigms 3. Filter Unlikely Candidates

Spanish

Paradigm CellsInflection Class

ar er ir1st, Sg, Present o o o2nd, Sg, Present as es es3rd, Sg, Present a e e1st, Pl, Present amos emos imos

... … … …

Hab Mode ReportPol / Mood

TenseObj Agr

ke pe (ü)rkela a

fiki fu

Ø Ø Ønu afu

ØØ Ø

Mapudungun (Non-Indoeuropean, Central Chile)

Subj Agr / Mood(ü)n

lichiyu…

Loc Asp

pa tu

pu ka

Ø Ø

Results

Inflectional & Derivational Inflectional Only

English German English German

P R F1 σ(F1) P R F1 σ(F1) P R F1 σ(F1) P R F1 σ(F1)

ParaMor 48.9 53.6 51.1 0.8 60.0 33.5 43.0 0.7 33.0 81.4 47.0 0.9 42.8 68.6 52.7 0.8

Morfessor 73.6 34.0 46.5 1.1 66.9 37.1 47.7 0.7 53.3 47.0 49.9 1.3 38.7 44.2 41.2 0.8

Segmentation Evaluation Methodology

Morpho Challenge 2007

Competition for unsupervised morphology induction algorithms

English

3rd Place Overall

Bested Morfessor (Creutz, 2006) a state-of-the-art unsupervised morphology induction algorithm

German

1st Place with Combined ParaMor-Morfessor System

Small Candidates contain few affixes and cover few types

Incorrect Morpheme Boundary Candi- dates segment too far to the left.

Ø.ipo covers 8 words

Ø.e.iu covers 12 words

iza.izado.izan.izar.izaron.izarán.izó

der.derá.dido.diendo.dieron.dió.día

16: a.aba.ada.adas.ado.ados.an.ando.ar.ara.aron.arse.ará.arán.aría.ó

15: a.aba.ada.adas.ado.ados.an.ando.ar.ara.aron.arse.ará.arán.ó

15: a.aba.ada.adas.ado.ados.an.ando.ar.aron.arse.ará.arán.aría.ó

15: a.aba.aban.ada.adas.ado.ados.an.ando.ar.aron.arse.ará.arán.ó

17: a.aba.aban.ada.adas.ado.ados.an.ando.ar.ara.aron.arse.ará.arán.aría.ó

llega

Error analysis identified 2 major categories of incorrect candidates

1. Match word to segment against clustered affixes

2. Replace any matched affix with new affix from cluster

3. Segment the original word, if the corpus contains the hypothesized word form

lleg aballeg abanlleg ada …

lleg +a

1. Sample pairs of words that share morphemes.

Precision:Sample pairs sharing a morpheme in the automatic analyses

Recall: Sample pairs from an answer key of morphologically analyzed words

2. Examine corresponding analyses

Precsion: Count sampled pairs that share a morpheme in the answer key

Recall: Count sampled pairs that share a morpheme in the automatic analyses

Cross-linguistically, languages inflect using paradigms—sets of mutually exclusive cells. Exactly one cell from each paradigm can be filled (by an affix) in a surface word form.

1. Search – Greedy bottom-up search through an empirical network of candidate partial paradigms. Here, red candidate paradigms are active in search

2. Cluster – Hierarchical agglomerative clustering adapted to the peculiarities of partial paradigms

3. Filter – Improve precision by removing unclustered and unlikely candidates

Spanish data guided algorithm development and parameter adjustment

1. Recall Centric Search

e.er.erá.ido.ieron.ió28: deb, escog, ofrec, roconoc, vend, ...

e.ido.ieron.ir.irá.ió28: asist, dirig, exig, ocurr, sufr, ...

e.erá.ido.ieron.ió28: deb, escog, ...

e.er.ido.ieron.ió46: deb, parec, recog...

e.ido.ieron.irá.ió28: asist, dirig, ...

e.ido.ieron.ir.ió39: asist, bat, sal, ...

e.er.erá.ieron.ió32: deb, padec, romp, ...

e.ido.ieron.ió86: asist, deb, hund,...

e.erá.ieron.ió32: deb, padec, ...

er.ido.ieron.ió58: ascend, ejerc, recog, ...

ido.ieron.ir.ió44: interrump, sal, ...

azar.e.ido.ieron.ir.ió1: sal

The Next Steps

Extend ParaMor to hypothesize more than one morpheme boundary per analysis

Expand beyond suffixation to other morphological phenomena, prefixes, etc.

Merge inflection classes of the same paradigm

Identify morphophonemic changes

A Closer Look at ParaMor vs. Morfessor