40
Optimality Theoretic Learning of Lexical Borrowing Yulia Tsvetkov Waleed Ammar Chris Dyer

Lexical Borrowing Learning of Optimality TheoreticMURI/Presentations/year4-2014-11-14/...2014/11/14  · 2. Expand the extracted training set using Arabic morph. analyzer 3. Learn

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lexical Borrowing Learning of Optimality TheoreticMURI/Presentations/year4-2014-11-14/...2014/11/14  · 2. Expand the extracted training set using Arabic morph. analyzer 3. Learn

Optimality Theoretic Learning of

Lexical BorrowingYulia Tsvetkov Waleed Ammar Chris Dyer

Page 2: Lexical Borrowing Learning of Optimality TheoreticMURI/Presentations/year4-2014-11-14/...2014/11/14  · 2. Expand the extracted training set using Arabic morph. analyzer 3. Learn

src

Book the flight …

VB DT NN …

tgt

project annotations

Resource-poor NLP

annotation projection1. via word alignments2. via cross-lingual similarities

Page 3: Lexical Borrowing Learning of Optimality TheoreticMURI/Presentations/year4-2014-11-14/...2014/11/14  · 2. Expand the extracted training set using Arabic morph. analyzer 3. Learn

Outline

1. Motivation: lexical borrowing as a source of cross-lingual lexical similarities

2. A constraint-based model of lexical borrowing for Arabic-Swahili

3. A model of lexical borrowing improves Swahili-English MT

*unpublished work, in preparation for NAACL’15

Page 4: Lexical Borrowing Learning of Optimality TheoreticMURI/Presentations/year4-2014-11-14/...2014/11/14  · 2. Expand the extracted training set using Arabic morph. analyzer 3. Learn

Words that are orthographically or phonetically similar across different languages and are likely to bemutual translations

Cross-lingual lexical similarities

Page 5: Lexical Borrowing Learning of Optimality TheoreticMURI/Presentations/year4-2014-11-14/...2014/11/14  · 2. Expand the extracted training set using Arabic morph. analyzer 3. Learn

Whence cross-lingual lexical similarities? ● Chance (unrelated, false friends)

○ insignificant amount of words

Page 6: Lexical Borrowing Learning of Optimality TheoreticMURI/Presentations/year4-2014-11-14/...2014/11/14  · 2. Expand the extracted training set using Arabic morph. analyzer 3. Learn

Whence cross-lingual lexical similarities? ● Foreign words (transliterations)

Core

Core-periphery lexicon structureItô & Mester ‘95

Periphery

English New YorkYoruba Niu YokiSwahili New YorkRussian Нью-ЙоркArabic نیویورك

Page 7: Lexical Borrowing Learning of Optimality TheoreticMURI/Presentations/year4-2014-11-14/...2014/11/14  · 2. Expand the extracted training set using Arabic morph. analyzer 3. Learn

Whence cross-lingual lexical similarities? ● Foreign words (transliterations)

○ proper names○ specialized, peripheral vocabulary

Core

Periphery

English New YorkYoruba Niu YokiSwahili New YorkRussian Нью-ЙоркArabic نیویورك

Page 8: Lexical Borrowing Learning of Optimality TheoreticMURI/Presentations/year4-2014-11-14/...2014/11/14  · 2. Expand the extracted training set using Arabic morph. analyzer 3. Learn

Whence cross-lingual lexical similarities? ● Foreign words (transliterations)● Genetically related words (cognates)

○ words in related languages inherited from one word in a common ancestral language

○ content words in core language lexicon

Core

Periphery

Latin nocteFrench nuitSpanish nocheItalian notte

Portuguese noiteRomanian noapte

Page 9: Lexical Borrowing Learning of Optimality TheoreticMURI/Presentations/year4-2014-11-14/...2014/11/14  · 2. Expand the extracted training set using Arabic morph. analyzer 3. Learn

Whence cross-lingual lexical similarities? ● Foreign words (transliterations)● Genetically related words (cognates)● Borrowed words

○ frequent content words○ of foreign origin, but aren’t perceived as foreign

Core

Periphery

Arabic سكرArabic

*transliteratedsukkar

Latin zuccarumFrench sucreGerman ZuckerItalian zucchero

English sugar

Page 10: Lexical Borrowing Learning of Optimality TheoreticMURI/Presentations/year4-2014-11-14/...2014/11/14  · 2. Expand the extracted training set using Arabic morph. analyzer 3. Learn

This work: Lexical borrowing

● Foreign words (transliterations)● Genetically related words (cognates)● Borrowed words (loanwords)

Arabic سكرArabic

*transliteratedsukkar

Latin zuccarumFrench sucreGerman ZuckerItalian zucchero

English sugar

Adoption and nativization of words from another language (as a result of language contact)

Page 11: Lexical Borrowing Learning of Optimality TheoreticMURI/Presentations/year4-2014-11-14/...2014/11/14  · 2. Expand the extracted training set using Arabic morph. analyzer 3. Learn

Borrowing is a fundamental research topic in linguistics

Yip ‘93 (Cantonese)

Davidson & Noyer ‘97 (Huave)

Jacobs & Gussenhoven ‘00

Kang ‘03 (Korean)

Kenstowicz & Suchato ‘06 (Thai)

Adler ‘06 (Hawaiian)

Rose & Demuth ‘06

Kenstowicz ‘07 (Fijian)

Schadeberg ‘09 (Swahili)

Mwita ‘09 (Swahili)

Hurskainen ‘04 (Swahili)

Adelaar ‘10 (Malagasy)

Kenstowicz ‘06 (Yoruba)

and many more...

Page 12: Lexical Borrowing Learning of Optimality TheoreticMURI/Presentations/year4-2014-11-14/...2014/11/14  · 2. Expand the extracted training set using Arabic morph. analyzer 3. Learn

TransliterationKnight & Graehl ‘98

Al-Onaizan & Knight ‘02

Virga & Khudanpur ‘03

Klementiev & Roth ‘06

Tao et al. ‘06

Ravi & Knight ‘09

Ammar,Dyer & Smith ‘12

Borrowing

Prior work (in NLP)

CognatesMann & Yarowsky ‘01

Kondrak ‘01

Kondrak,Marcu & Knight ‘03

Bouchard-Côté et al. ‘09

Hall & Klein ‘10

Page 13: Lexical Borrowing Learning of Optimality TheoreticMURI/Presentations/year4-2014-11-14/...2014/11/14  · 2. Expand the extracted training set using Arabic morph. analyzer 3. Learn

Lexical borrowing graph

پلپل pilpil

Persian

פלפלfalafel’

Hebrew

فالفلfalāfil

Arabic

pilipili

Swahili

parpaare

Gawwada

प पलpippalī

Sanskrit

Haspelmath & Tadmor ‘09

Page 14: Lexical Borrowing Learning of Optimality TheoreticMURI/Presentations/year4-2014-11-14/...2014/11/14  · 2. Expand the extracted training set using Arabic morph. analyzer 3. Learn

Borrowing is pervasive!

Resource-poor languages # speakers Borrowed from resource-rich (% types)

Swahili, Zulu, Malagasy, Hausa, Tarifit, Yoruba

200 million Arabic, Spanish, English, French (>40%)

Japanese, Vietnamese, Korean, Cantonese, Thai

400 million Chinese, English (30-70%)

Hindustani, Hindi, Urdu, Bengali, Persian, Pashto

860 million Arabic, English (>40%)

1.4 billion

Page 15: Lexical Borrowing Learning of Optimality TheoreticMURI/Presentations/year4-2014-11-14/...2014/11/14  · 2. Expand the extracted training set using Arabic morph. analyzer 3. Learn

Case study: Arabic-Swahili borrowing

پلپل pilpil

Persian

פלפלfalafel’

Hebrew

فالفلfalāfil

Arabic

pilipili

Swahili

parpaare

Gawwada

प पलpippalī

Sanskrit

Page 16: Lexical Borrowing Learning of Optimality TheoreticMURI/Presentations/year4-2014-11-14/...2014/11/14  · 2. Expand the extracted training set using Arabic morph. analyzer 3. Learn

Arabic-Swahili borrowing: history● 800 A.D.-1920 Indian Ocean trading● Influence of Islam

● ~40% of Swahili types are borrowed from Arabic

*from Standard Swahili-English dictionary (Johnson ‘39)

Page 17: Lexical Borrowing Learning of Optimality TheoreticMURI/Presentations/year4-2014-11-14/...2014/11/14  · 2. Expand the extracted training set using Arabic morph. analyzer 3. Learn

Arabic-Swahili borrowing: examples

English ArabicSemitic

SwahiliBantu

Phonological & morphological integration

fever حمىḥummat

homa* syllable structure adaptation: CV, CVV, CVC, CVCC → V, CV* degemination - Swahili does not allow consonant clusters* vowel substitution

minister الوزیرAlwzyr

kiuwaziri

* Arabic morphology (optionally) drops* Swahili morphology is applied* vowel epenthesis to keep syllables open* vowel substitution

palace القصرAlqSr

kasiri * consonant adaptation: /tˤ/→/t/, /dˤ/→/d/, /θ/→/s/, /x/→/k/, etc* vowel epenthesis

Page 18: Lexical Borrowing Learning of Optimality TheoreticMURI/Presentations/year4-2014-11-14/...2014/11/14  · 2. Expand the extracted training set using Arabic morph. analyzer 3. Learn

Arabic-Swahili borrowing: our research goals

1. Given a Swahili vocabulary and an Arabic vocabulary, identify plausible donor-loanword candidates

2. Produce a ranked list of candidate donor-loanword pairs

3. Augment Swahili-English MT using Arabic-Swahili borrowing model

Page 19: Lexical Borrowing Learning of Optimality TheoreticMURI/Presentations/year4-2014-11-14/...2014/11/14  · 2. Expand the extracted training set using Arabic morph. analyzer 3. Learn

Arabic-Swahili borrowing model

Arabic to IPA SwahiliRank

loanword candidates

from IPAGenerate loanword candidates

1. Convert letters to phones2. Generate loanword candidates3. Rank loanword candidates

rule-based

learned

Page 20: Lexical Borrowing Learning of Optimality TheoreticMURI/Presentations/year4-2014-11-14/...2014/11/14  · 2. Expand the extracted training set using Arabic morph. analyzer 3. Learn

Arabic-Swahili borrowing model: from orthographic to phonetic space

Arabic to IPA SwahiliRank loanword candidates

from IPAGenerate loanword candidates

(book.sg.indef)

كتاباkuttabakitaba...

kitabukitabu

1. Convert letters to phones

Page 21: Lexical Borrowing Learning of Optimality TheoreticMURI/Presentations/year4-2014-11-14/...2014/11/14  · 2. Expand the extracted training set using Arabic morph. analyzer 3. Learn

Arabic-Swahili borrowing model: generating candidate loanwords

Arabic to IPA SwahiliRank loanword candidates

from IPASyllabificationMorphological adaptationPhonological adaptation

(book.sg.indef)

كتاباkuttabakitaba...

kitabukitabu

2. Adapt Arabic words to Swahili syllable structure, morphology and phonology

Polomé ‘67; Zawawi ‘79; Schadeberg ‘09; Mwita ‘09

Page 22: Lexical Borrowing Learning of Optimality TheoreticMURI/Presentations/year4-2014-11-14/...2014/11/14  · 2. Expand the extracted training set using Arabic morph. analyzer 3. Learn

ku.tata.ba.li.ku.tata.ba.vi.ki.ta.bu. ki.ta.bu.ki.ta.bu.(book.sg.indef)

كتاباkuttabakitaba...

kitabukitabu

SyllabificationSwahili Morphologicaladaptation

Arabic-to-SwahiliPhonological adaptation

Arabic affixremoval

kuttabakuttabkitabakitab...

ku.tta.ba.ku.t.ta.ba....ki.ta.ba.ki.ta.b.

ku.ta.ba. [degemination]

ku.tata.ba.[epenthesis]

ku.ta.bu. [final vowel subst.]

ki.ta.bu. [final vowel subst.]

ki.ta.bu. [epenthesis]

2. Adapt Arabic words to Swahili syllable structure, morphology and phonology

Arabic-Swahili borrowing model: generating candidate loanwords

(Littell, Price & Levin ‘14)

Page 23: Lexical Borrowing Learning of Optimality TheoreticMURI/Presentations/year4-2014-11-14/...2014/11/14  · 2. Expand the extracted training set using Arabic morph. analyzer 3. Learn

Arabic-Swahili borrowing model: learning candidate ranking

Arabic to IPA SwahiliRanking with Optimality Theory constraints

from IPASyllabificationMorphological adaptationPhonological adaptation

(book.sg.indef)

كتاباkuttabakitaba...

kitabukitabu

3. Produce a ranked list of candidate loanwords

ku.tata.ba.li.ku.tata.ba.vi.ki.ta.bu. ki.ta.bu.ki.ta.bu....

Page 24: Lexical Borrowing Learning of Optimality TheoreticMURI/Presentations/year4-2014-11-14/...2014/11/14  · 2. Expand the extracted training set using Arabic morph. analyzer 3. Learn

Optimality Theorylanguage-universal

constraints

underlying (donor) form

pronounced forms(loanword candidates)

optimal (loanword) form

*competing, violable

constraints ranked differently

in donor and recipient

languages

Prince & Smolensky ‘08; McCarthy ‘09

Page 25: Lexical Borrowing Learning of Optimality TheoreticMURI/Presentations/year4-2014-11-14/...2014/11/14  · 2. Expand the extracted training set using Arabic morph. analyzer 3. Learn

Optimality Theory constraintsFaithfulness Constraints

MAX - IO - MORPH MAX - IO - CMAX - IO - V

no (donor) affix deletionno consonant deletionno vowel deletion

DEP - IO - MORPHDEP - IO - V

no (recipient) affix epenthesisno vowel epenthesis

IDENT - IO - P IDENT - IO - G IDENT - IO - EIDENT - IO - C IDENT - IO - F IDENT - IO - V

no pharyngeal consonant substitutionno glottal consonant substitutionno emphatic consonant substitutionno consonant substitutionno final vowel substitutionno vowel substitution

Faithfulness constraints impose input-output correspondence

Page 26: Lexical Borrowing Learning of Optimality TheoreticMURI/Presentations/year4-2014-11-14/...2014/11/14  · 2. Expand the extracted training set using Arabic morph. analyzer 3. Learn

Markedness Constraints

Optimality Theory constraints

NO-CODA ONSETPEAKSSP* COMPLEX - S* COMPLEX - C* COMPLEX - V

syllables must not have a codasyllables must have onsetsthere is only one syllabic peakcomplex onsets rise in sonorityno consonant clusters on syllable marginsno consonant clusters within a syllableno vowel clusters

Markedness constraints impose output well-formedness

Page 27: Lexical Borrowing Learning of Optimality TheoreticMURI/Presentations/year4-2014-11-14/...2014/11/14  · 2. Expand the extracted training set using Arabic morph. analyzer 3. Learn

Arabic to IPA SwahiliRanking with Optimality Theory constraints

from IPASyllabificationMorphological adaptationPhonological adaptation

(book.sg.indef)

كتاباkuttabakitaba...

kitabukitabu

3. Produce a ranked list of candidate loanwords

ku.tata.ba.li.ku.tata.ba.vi.ki.ta.bu. ki.ta.bu.ki.ta.bu.

Arabic-Swahili borrowing model: learning candidate ranking

Page 28: Lexical Borrowing Learning of Optimality TheoreticMURI/Presentations/year4-2014-11-14/...2014/11/14  · 2. Expand the extracted training set using Arabic morph. analyzer 3. Learn

Arabic to IPA SwahiliRanking with Optimality Theory constraints

from IPASyllabificationMorphological adaptationPhonological adaptation

(book.sg.indef)

كتاباkuttabakitaba...

kitabukitabu

3. Produce a ranked list of candidate loanwords

ku.tata.ba.li.ku.tata.ba.ku.tta.ba. ki.ta.bu.ki.ta.bu.

ku.ta<DEP-V>ta<PEAK>.ba.li<DEP-MORPH>.ku.ta<DEP-V>ta<PEAK>.ba.li.ku.tta<*COMPLEX>.ba.ki.ta.bu<IDENT-IO-V>.ki.ta.bu<DEP-V>.

Arabic-Swahili borrowing model: learning candidate ranking

Page 29: Lexical Borrowing Learning of Optimality TheoreticMURI/Presentations/year4-2014-11-14/...2014/11/14  · 2. Expand the extracted training set using Arabic morph. analyzer 3. Learn

EVAL

Re-rank loanword candidates to promote input-output correspondence and output well-formedness

Arabicwords

Donor words to IPA

Swahiliwords

Ranking with Optimality Theory constraints

IPA to Recipient words

GEN

Generate plausible Swahili phonetic forms

SyllabificationMorphological adaptationPhonological adaptation

Arabic-Swahili borrowing model

Unweighted insertion/deletion/substitution transducers

Weighted identity transducers

Page 30: Lexical Borrowing Learning of Optimality TheoreticMURI/Presentations/year4-2014-11-14/...2014/11/14  · 2. Expand the extracted training set using Arabic morph. analyzer 3. Learn

1. Extract a small training set from Arabic-English and English-Swahili parallel corpora based on phonetic and semantic similarity (cf. Kondrak ‘01, cognate identification)

2. Expand the extracted training set using Arabic morph. analyzer

3. Learn OT constraint weights using Machine Learning

Arabic-Swahili borrowing model:learning constraint weights

TrainingTest

417 examples73 examples (15%), manually verified by a native Arabic speaker and using a Swahili-English dictionary

Page 31: Lexical Borrowing Learning of Optimality TheoreticMURI/Presentations/year4-2014-11-14/...2014/11/14  · 2. Expand the extracted training set using Arabic morph. analyzer 3. Learn

Arabic-Swahili borrowing model:evaluation

1. Model design

2. Model accuracy

3. Qualitative evaluationOT constraint ranking is consistent with linguistic accounts

Dev Test

ReachabilityAmbiguity

75885

88857

(%)(avg. candidates per input word, baseline:787,000)

Accuracy (%)

Levenshtein CRF (transliteration Ammar et al. ‘12)

8.916.4

Levenshtein Levenshtein-H (cognate Mann & Yarowsky ‘01)

19.819.7

OT uniform constraint weightsOT learned constraint weights

29.352.0

orth

ogra

phic

phon

etic

OT

Page 32: Lexical Borrowing Learning of Optimality TheoreticMURI/Presentations/year4-2014-11-14/...2014/11/14  · 2. Expand the extracted training set using Arabic morph. analyzer 3. Learn

Arabic-Swahili borrowing: research goals

1. Given a Swahili vocabulary and an Arabic vocabulary, identify plausible donor-loanword candidates

2. Produce a ranked list of candidate donor-loanword pairs

3. Augment Swahili-English MT using Arabic-Swahili borrowing model

Page 33: Lexical Borrowing Learning of Optimality TheoreticMURI/Presentations/year4-2014-11-14/...2014/11/14  · 2. Expand the extracted training set using Arabic morph. analyzer 3. Learn

AR

Arabic-English MTResource-rich 5.5M sentences

SW

safarikituruki

ysAfr travel یسافرtrky turkish تركي

Swahili-English MTLow-resource 14K sentences 5K OOV types (7.5%)

EN

??? (OOV)

BORROWINGMODEL

TRANSLATIONCANDIDATES

EN

MT experiments

BLEU

Baseline 18.0

+ OOV loanwords 18.5

Page 34: Lexical Borrowing Learning of Optimality TheoreticMURI/Presentations/year4-2014-11-14/...2014/11/14  · 2. Expand the extracted training set using Arabic morph. analyzer 3. Learn

1. First study on lexical borrowing in NLP

2. First study that operationalizes Optimality Theory in a downstream task

3. Swahili-English MT improvement

Summary of contributions

Page 35: Lexical Borrowing Learning of Optimality TheoreticMURI/Presentations/year4-2014-11-14/...2014/11/14  · 2. Expand the extracted training set using Arabic morph. analyzer 3. Learn

1. More languages

2. More MT experiments

3. Core NLP tasks: cross-lingual part-of-speech tagging

Future work

Page 36: Lexical Borrowing Learning of Optimality TheoreticMURI/Presentations/year4-2014-11-14/...2014/11/14  · 2. Expand the extracted training set using Arabic morph. analyzer 3. Learn

Swahili shukuruArabic shukran - شكرا

English thank you

Page 37: Lexical Borrowing Learning of Optimality TheoreticMURI/Presentations/year4-2014-11-14/...2014/11/14  · 2. Expand the extracted training set using Arabic morph. analyzer 3. Learn

*a study on 1,460 core words Schadeberg ‘09

Loanwords (% within sem. field)

Semantic field Total Arabic English Other

MODERN WORLD 73.6 15.1 43.7 14.8

RELIGION 55.7 47.5 - 9.2

LAW 54.6 41.1 9.4 4.1

POSSESSION 48.1 41.4 1.9 4.9

SOCIO - POLITICAL 47.5 37.9 - 9.6

EMOTIONS 46.8 39 1.6 6.2

COGNITION 46 40.6 1.5 3.9

CLOTHING 43.4 11.1 18.8 13.5

THE HOUSE 37.5 19.3 6.6 11.7

nouns 19%

adjectives 19%

verbs 15%

adverbs 14%

func. words 15%

Arabic-Swahili borrowing statistics

Page 38: Lexical Borrowing Learning of Optimality TheoreticMURI/Presentations/year4-2014-11-14/...2014/11/14  · 2. Expand the extracted training set using Arabic morph. analyzer 3. Learn

http://blog.oxforddictionaries.com/2014/08/which-everyday-english-words-came-from-arabic/

Page 39: Lexical Borrowing Learning of Optimality TheoreticMURI/Presentations/year4-2014-11-14/...2014/11/14  · 2. Expand the extracted training set using Arabic morph. analyzer 3. Learn

(book.sg.indef)

SyllabificationDonorwords

Donor words to IPA

Loanwords

Ranking with Optimality Theory constraints

Recipient Morphologicaladaptation

IPA to Recipient words

Donor-to-Recipient Phonological adaptation

Donor affixremoval

GEN EVAL

كتاباkuttaba

kitaba...

kuttabakuttabkitabakitab...

ku.tta.ba.ku.t.ta.ba....ki.ta.ba.ki.ta.b....

ku.ta.ba. [degemination]ku.tata.ba. [epenthesis]ku.ta.bu. [final vowel subst.]ki.ta.bu. [final vowel subst.]ki.ta.bu. [epenthesis]...

ku.tata.ba.li.ku.tata.ba.vi.ki.ta.bu. ki.ta.bu.ki.ta.bu. ...

kitabuku.ta<DEP-V>ta<PEAK>.ba.li<DEP-MORPH>.ku.ta<DEP-V>ta<PEAK>.ba.li.ku.tta<*COMPLEX>.ba.ki.ta.bu<IDENT-IO-V>.ki.ta.bu<DEP-V>.vi<DEP-MORPH>.ki.ta.bu<IDENT-IO-V>.

kitabu

ARABIC SWAHILI

Arabic-Swahili borrowing model

Page 40: Lexical Borrowing Learning of Optimality TheoreticMURI/Presentations/year4-2014-11-14/...2014/11/14  · 2. Expand the extracted training set using Arabic morph. analyzer 3. Learn

● Syllable structure CV, CVV, CVC, CVCC → V, CV

● MorphologyArabic affixes deletion (optional) Swahili affixes concatenation

● PhonologyVowel deletion – shortening of Arabic long vowels and vowel clusters Consonant degemination – shortening of Arabic geminate consonantsSubstitution of similar phones – /tˤ/→/t/, /dˤ/→/d/, /θ/→/s/, /x/→/k/, etc.Vowel epenthesis – eliminating Arabic codas and consonant clustersFinal vowel substitution – /u/, /o/, /i/, /e/

Arabic-Swahili morphophonological adaptation