Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Optimality Theoretic Learning of
Lexical BorrowingYulia Tsvetkov Waleed Ammar Chris Dyer
src
Book the flight …
VB DT NN …
tgt
project annotations
Resource-poor NLP
annotation projection1. via word alignments2. via cross-lingual similarities
Outline
1. Motivation: lexical borrowing as a source of cross-lingual lexical similarities
2. A constraint-based model of lexical borrowing for Arabic-Swahili
3. A model of lexical borrowing improves Swahili-English MT
*unpublished work, in preparation for NAACL’15
Words that are orthographically or phonetically similar across different languages and are likely to bemutual translations
Cross-lingual lexical similarities
Whence cross-lingual lexical similarities? ● Chance (unrelated, false friends)
○ insignificant amount of words
Whence cross-lingual lexical similarities? ● Foreign words (transliterations)
Core
Core-periphery lexicon structureItô & Mester ‘95
Periphery
English New YorkYoruba Niu YokiSwahili New YorkRussian Нью-ЙоркArabic نیویورك
Whence cross-lingual lexical similarities? ● Foreign words (transliterations)
○ proper names○ specialized, peripheral vocabulary
Core
Periphery
English New YorkYoruba Niu YokiSwahili New YorkRussian Нью-ЙоркArabic نیویورك
Whence cross-lingual lexical similarities? ● Foreign words (transliterations)● Genetically related words (cognates)
○ words in related languages inherited from one word in a common ancestral language
○ content words in core language lexicon
Core
Periphery
Latin nocteFrench nuitSpanish nocheItalian notte
Portuguese noiteRomanian noapte
Whence cross-lingual lexical similarities? ● Foreign words (transliterations)● Genetically related words (cognates)● Borrowed words
○ frequent content words○ of foreign origin, but aren’t perceived as foreign
Core
Periphery
Arabic سكرArabic
*transliteratedsukkar
Latin zuccarumFrench sucreGerman ZuckerItalian zucchero
English sugar
This work: Lexical borrowing
● Foreign words (transliterations)● Genetically related words (cognates)● Borrowed words (loanwords)
Arabic سكرArabic
*transliteratedsukkar
Latin zuccarumFrench sucreGerman ZuckerItalian zucchero
English sugar
Adoption and nativization of words from another language (as a result of language contact)
Borrowing is a fundamental research topic in linguistics
Yip ‘93 (Cantonese)
Davidson & Noyer ‘97 (Huave)
Jacobs & Gussenhoven ‘00
Kang ‘03 (Korean)
Kenstowicz & Suchato ‘06 (Thai)
Adler ‘06 (Hawaiian)
Rose & Demuth ‘06
Kenstowicz ‘07 (Fijian)
Schadeberg ‘09 (Swahili)
Mwita ‘09 (Swahili)
Hurskainen ‘04 (Swahili)
Adelaar ‘10 (Malagasy)
Kenstowicz ‘06 (Yoruba)
and many more...
TransliterationKnight & Graehl ‘98
Al-Onaizan & Knight ‘02
Virga & Khudanpur ‘03
Klementiev & Roth ‘06
Tao et al. ‘06
Ravi & Knight ‘09
Ammar,Dyer & Smith ‘12
Borrowing
✘
Prior work (in NLP)
CognatesMann & Yarowsky ‘01
Kondrak ‘01
Kondrak,Marcu & Knight ‘03
Bouchard-Côté et al. ‘09
Hall & Klein ‘10
Lexical borrowing graph
پلپل pilpil
Persian
פלפלfalafel’
Hebrew
فالفلfalāfil
Arabic
pilipili
Swahili
parpaare
Gawwada
प पलpippalī
Sanskrit
Haspelmath & Tadmor ‘09
Borrowing is pervasive!
Resource-poor languages # speakers Borrowed from resource-rich (% types)
Swahili, Zulu, Malagasy, Hausa, Tarifit, Yoruba
200 million Arabic, Spanish, English, French (>40%)
Japanese, Vietnamese, Korean, Cantonese, Thai
400 million Chinese, English (30-70%)
Hindustani, Hindi, Urdu, Bengali, Persian, Pashto
860 million Arabic, English (>40%)
1.4 billion
Case study: Arabic-Swahili borrowing
پلپل pilpil
Persian
פלפלfalafel’
Hebrew
فالفلfalāfil
Arabic
pilipili
Swahili
parpaare
Gawwada
प पलpippalī
Sanskrit
Arabic-Swahili borrowing: history● 800 A.D.-1920 Indian Ocean trading● Influence of Islam
● ~40% of Swahili types are borrowed from Arabic
*from Standard Swahili-English dictionary (Johnson ‘39)
Arabic-Swahili borrowing: examples
English ArabicSemitic
SwahiliBantu
Phonological & morphological integration
fever حمىḥummat
homa* syllable structure adaptation: CV, CVV, CVC, CVCC → V, CV* degemination - Swahili does not allow consonant clusters* vowel substitution
minister الوزیرAlwzyr
kiuwaziri
* Arabic morphology (optionally) drops* Swahili morphology is applied* vowel epenthesis to keep syllables open* vowel substitution
palace القصرAlqSr
kasiri * consonant adaptation: /tˤ/→/t/, /dˤ/→/d/, /θ/→/s/, /x/→/k/, etc* vowel epenthesis
Arabic-Swahili borrowing: our research goals
1. Given a Swahili vocabulary and an Arabic vocabulary, identify plausible donor-loanword candidates
2. Produce a ranked list of candidate donor-loanword pairs
3. Augment Swahili-English MT using Arabic-Swahili borrowing model
Arabic-Swahili borrowing model
Arabic to IPA SwahiliRank
loanword candidates
from IPAGenerate loanword candidates
1. Convert letters to phones2. Generate loanword candidates3. Rank loanword candidates
rule-based
learned
Arabic-Swahili borrowing model: from orthographic to phonetic space
Arabic to IPA SwahiliRank loanword candidates
from IPAGenerate loanword candidates
(book.sg.indef)
كتاباkuttabakitaba...
kitabukitabu
1. Convert letters to phones
Arabic-Swahili borrowing model: generating candidate loanwords
Arabic to IPA SwahiliRank loanword candidates
from IPASyllabificationMorphological adaptationPhonological adaptation
(book.sg.indef)
كتاباkuttabakitaba...
kitabukitabu
2. Adapt Arabic words to Swahili syllable structure, morphology and phonology
Polomé ‘67; Zawawi ‘79; Schadeberg ‘09; Mwita ‘09
ku.tata.ba.li.ku.tata.ba.vi.ki.ta.bu. ki.ta.bu.ki.ta.bu.(book.sg.indef)
كتاباkuttabakitaba...
kitabukitabu
SyllabificationSwahili Morphologicaladaptation
Arabic-to-SwahiliPhonological adaptation
Arabic affixremoval
kuttabakuttabkitabakitab...
ku.tta.ba.ku.t.ta.ba....ki.ta.ba.ki.ta.b.
ku.ta.ba. [degemination]
ku.tata.ba.[epenthesis]
ku.ta.bu. [final vowel subst.]
ki.ta.bu. [final vowel subst.]
ki.ta.bu. [epenthesis]
2. Adapt Arabic words to Swahili syllable structure, morphology and phonology
Arabic-Swahili borrowing model: generating candidate loanwords
(Littell, Price & Levin ‘14)
Arabic-Swahili borrowing model: learning candidate ranking
Arabic to IPA SwahiliRanking with Optimality Theory constraints
from IPASyllabificationMorphological adaptationPhonological adaptation
(book.sg.indef)
كتاباkuttabakitaba...
kitabukitabu
3. Produce a ranked list of candidate loanwords
ku.tata.ba.li.ku.tata.ba.vi.ki.ta.bu. ki.ta.bu.ki.ta.bu....
Optimality Theorylanguage-universal
constraints
underlying (donor) form
pronounced forms(loanword candidates)
optimal (loanword) form
*competing, violable
constraints ranked differently
in donor and recipient
languages
Prince & Smolensky ‘08; McCarthy ‘09
Optimality Theory constraintsFaithfulness Constraints
MAX - IO - MORPH MAX - IO - CMAX - IO - V
no (donor) affix deletionno consonant deletionno vowel deletion
DEP - IO - MORPHDEP - IO - V
no (recipient) affix epenthesisno vowel epenthesis
IDENT - IO - P IDENT - IO - G IDENT - IO - EIDENT - IO - C IDENT - IO - F IDENT - IO - V
no pharyngeal consonant substitutionno glottal consonant substitutionno emphatic consonant substitutionno consonant substitutionno final vowel substitutionno vowel substitution
Faithfulness constraints impose input-output correspondence
Markedness Constraints
Optimality Theory constraints
NO-CODA ONSETPEAKSSP* COMPLEX - S* COMPLEX - C* COMPLEX - V
syllables must not have a codasyllables must have onsetsthere is only one syllabic peakcomplex onsets rise in sonorityno consonant clusters on syllable marginsno consonant clusters within a syllableno vowel clusters
Markedness constraints impose output well-formedness
Arabic to IPA SwahiliRanking with Optimality Theory constraints
from IPASyllabificationMorphological adaptationPhonological adaptation
(book.sg.indef)
كتاباkuttabakitaba...
kitabukitabu
3. Produce a ranked list of candidate loanwords
ku.tata.ba.li.ku.tata.ba.vi.ki.ta.bu. ki.ta.bu.ki.ta.bu.
Arabic-Swahili borrowing model: learning candidate ranking
Arabic to IPA SwahiliRanking with Optimality Theory constraints
from IPASyllabificationMorphological adaptationPhonological adaptation
(book.sg.indef)
كتاباkuttabakitaba...
kitabukitabu
3. Produce a ranked list of candidate loanwords
ku.tata.ba.li.ku.tata.ba.ku.tta.ba. ki.ta.bu.ki.ta.bu.
ku.ta<DEP-V>ta<PEAK>.ba.li<DEP-MORPH>.ku.ta<DEP-V>ta<PEAK>.ba.li.ku.tta<*COMPLEX>.ba.ki.ta.bu<IDENT-IO-V>.ki.ta.bu<DEP-V>.
Arabic-Swahili borrowing model: learning candidate ranking
EVAL
Re-rank loanword candidates to promote input-output correspondence and output well-formedness
Arabicwords
Donor words to IPA
Swahiliwords
Ranking with Optimality Theory constraints
IPA to Recipient words
GEN
Generate plausible Swahili phonetic forms
SyllabificationMorphological adaptationPhonological adaptation
Arabic-Swahili borrowing model
Unweighted insertion/deletion/substitution transducers
Weighted identity transducers
1. Extract a small training set from Arabic-English and English-Swahili parallel corpora based on phonetic and semantic similarity (cf. Kondrak ‘01, cognate identification)
2. Expand the extracted training set using Arabic morph. analyzer
3. Learn OT constraint weights using Machine Learning
Arabic-Swahili borrowing model:learning constraint weights
TrainingTest
417 examples73 examples (15%), manually verified by a native Arabic speaker and using a Swahili-English dictionary
Arabic-Swahili borrowing model:evaluation
1. Model design
2. Model accuracy
3. Qualitative evaluationOT constraint ranking is consistent with linguistic accounts
Dev Test
ReachabilityAmbiguity
75885
88857
(%)(avg. candidates per input word, baseline:787,000)
Accuracy (%)
Levenshtein CRF (transliteration Ammar et al. ‘12)
8.916.4
Levenshtein Levenshtein-H (cognate Mann & Yarowsky ‘01)
19.819.7
OT uniform constraint weightsOT learned constraint weights
29.352.0
orth
ogra
phic
phon
etic
OT
Arabic-Swahili borrowing: research goals
1. Given a Swahili vocabulary and an Arabic vocabulary, identify plausible donor-loanword candidates
2. Produce a ranked list of candidate donor-loanword pairs
3. Augment Swahili-English MT using Arabic-Swahili borrowing model
✔
✔
AR
Arabic-English MTResource-rich 5.5M sentences
SW
safarikituruki
ysAfr travel یسافرtrky turkish تركي
Swahili-English MTLow-resource 14K sentences 5K OOV types (7.5%)
EN
??? (OOV)
BORROWINGMODEL
TRANSLATIONCANDIDATES
EN
MT experiments
BLEU
Baseline 18.0
+ OOV loanwords 18.5
1. First study on lexical borrowing in NLP
2. First study that operationalizes Optimality Theory in a downstream task
3. Swahili-English MT improvement
Summary of contributions
1. More languages
2. More MT experiments
3. Core NLP tasks: cross-lingual part-of-speech tagging
Future work
Swahili shukuruArabic shukran - شكرا
English thank you
*a study on 1,460 core words Schadeberg ‘09
Loanwords (% within sem. field)
Semantic field Total Arabic English Other
MODERN WORLD 73.6 15.1 43.7 14.8
RELIGION 55.7 47.5 - 9.2
LAW 54.6 41.1 9.4 4.1
POSSESSION 48.1 41.4 1.9 4.9
SOCIO - POLITICAL 47.5 37.9 - 9.6
EMOTIONS 46.8 39 1.6 6.2
COGNITION 46 40.6 1.5 3.9
CLOTHING 43.4 11.1 18.8 13.5
THE HOUSE 37.5 19.3 6.6 11.7
nouns 19%
adjectives 19%
verbs 15%
adverbs 14%
func. words 15%
Arabic-Swahili borrowing statistics
http://blog.oxforddictionaries.com/2014/08/which-everyday-english-words-came-from-arabic/
(book.sg.indef)
SyllabificationDonorwords
Donor words to IPA
Loanwords
Ranking with Optimality Theory constraints
Recipient Morphologicaladaptation
IPA to Recipient words
Donor-to-Recipient Phonological adaptation
Donor affixremoval
GEN EVAL
كتاباkuttaba
kitaba...
kuttabakuttabkitabakitab...
ku.tta.ba.ku.t.ta.ba....ki.ta.ba.ki.ta.b....
ku.ta.ba. [degemination]ku.tata.ba. [epenthesis]ku.ta.bu. [final vowel subst.]ki.ta.bu. [final vowel subst.]ki.ta.bu. [epenthesis]...
ku.tata.ba.li.ku.tata.ba.vi.ki.ta.bu. ki.ta.bu.ki.ta.bu. ...
kitabuku.ta<DEP-V>ta<PEAK>.ba.li<DEP-MORPH>.ku.ta<DEP-V>ta<PEAK>.ba.li.ku.tta<*COMPLEX>.ba.ki.ta.bu<IDENT-IO-V>.ki.ta.bu<DEP-V>.vi<DEP-MORPH>.ki.ta.bu<IDENT-IO-V>.
kitabu
ARABIC SWAHILI
Arabic-Swahili borrowing model
● Syllable structure CV, CVV, CVC, CVCC → V, CV
● MorphologyArabic affixes deletion (optional) Swahili affixes concatenation
● PhonologyVowel deletion – shortening of Arabic long vowels and vowel clusters Consonant degemination – shortening of Arabic geminate consonantsSubstitution of similar phones – /tˤ/→/t/, /dˤ/→/d/, /θ/→/s/, /x/→/k/, etc.Vowel epenthesis – eliminating Arabic codas and consonant clustersFinal vowel substitution – /u/, /o/, /i/, /e/
Arabic-Swahili morphophonological adaptation