52
Language Change as a Constrained Multi-Objective Optimization Monojit Choudhury Microsoft Research Lab, India [email protected] A tale of the lazy tongue Indo-Australia Workshop on Optimization in Human Language Technology 16 th Dec 2012, IIT Patna

Language Change as a Constrained Multi-Objective Optimization Monojit Choudhury Microsoft Research Lab, India [email protected] A tale of the lazy

Embed Size (px)

Citation preview

Page 1: Language Change as a Constrained Multi-Objective Optimization Monojit Choudhury Microsoft Research Lab, India monojitc@microsoft.com A tale of the lazy

Language Change as a

Constrained Multi-Objective

Optimization

Monojit ChoudhuryMicrosoft Research Lab, India

[email protected]

A tale

of t

he la

zy to

ngue

Indo-Australia Workshop on Optimization in Human Language Technology16th Dec 2012, IIT Patna

Page 2: Language Change as a Constrained Multi-Objective Optimization Monojit Choudhury Microsoft Research Lab, India monojitc@microsoft.com A tale of the lazy

Language Change

Page 3: Language Change as a Constrained Multi-Objective Optimization Monojit Choudhury Microsoft Research Lab, India monojitc@microsoft.com A tale of the lazy

Language Change

• Change in the syntactic/semantic/phonological features of a language

• Perpetual, universal, directional (?)

• Phonological Change: – Affects the sounds– Structured, independent of syntax/semantics– Example: Loss of consonant clusters in Hindi

agni aag, dugdha dUdh, raatri raat

Page 4: Language Change as a Constrained Multi-Objective Optimization Monojit Choudhury Microsoft Research Lab, India monojitc@microsoft.com A tale of the lazy

Effects of the “Lazy Tongue”

Assimilation• in+apt = inapt• in+decent = indecent• in+polite = impolite• in+mature = immature• in+legal = illegal• in+regular = irregular

Deletion• cannot can’t• do not don’t• will not won’t• are not ain’t• information info

Page 5: Language Change as a Constrained Multi-Objective Optimization Monojit Choudhury Microsoft Research Lab, India monojitc@microsoft.com A tale of the lazy

Explanations for Change

Exogenous causes– Language contact– Socio-political

factors– Communication

medium

Endogenous causes– Functional– Phonetic error-based– Frequency drifts– Evolutionary

Page 6: Language Change as a Constrained Multi-Objective Optimization Monojit Choudhury Microsoft Research Lab, India monojitc@microsoft.com A tale of the lazy

Functional Explanation of Language Change

• There are three evolutionary forces on any linguistic system:– Minimization of effort (energy)– Maximization of perceptual distinctiveness

(Minimization of ambiguity)– Maximization of learnability

Language is a perpetually evolving system shaped by these three conflicting

forces

Page 7: Language Change as a Constrained Multi-Objective Optimization Monojit Choudhury Microsoft Research Lab, India monojitc@microsoft.com A tale of the lazy

Outline of the Talk

• Morpho-phonological change of Bangla Verb systems and emergence of dialect diversity– Approach: Multi-Objective Constrained Optimization– Technique: Multi-Objective Genetic Algorithm (MOGA)

• Understanding Computer Mediated Communication– Normalization of Texting language – Romanization of Indian Language text

Page 8: Language Change as a Constrained Multi-Objective Optimization Monojit Choudhury Microsoft Research Lab, India monojitc@microsoft.com A tale of the lazy

Geography of Bangla

• Standard Colloquial Bengali (SCB)

• Agartala Colloquial Bengali (ACB)

• Sylhetti

Page 9: Language Change as a Constrained Multi-Objective Optimization Monojit Choudhury Microsoft Research Lab, India monojitc@microsoft.com A tale of the lazy

History of Bangla

1200 AD 1800 AD

Page 10: Language Change as a Constrained Multi-Objective Optimization Monojit Choudhury Microsoft Research Lab, India monojitc@microsoft.com A tale of the lazy

BanglaVerb Morphology

করে�ছি�লা�মkar-echh-il-aam

Verb root (do)

Aspect (perfect)

Tense (past)

Person (first)

I had done

Page 11: Language Change as a Constrained Multi-Objective Optimization Monojit Choudhury Microsoft Research Lab, India monojitc@microsoft.com A tale of the lazy

Cognates in the Dialects

Features Classical SCB ACB

Non-finite kariyA kore kairAPs,2, per. kariyAChila koreChilo korsilo

Ps,1, cont. kariteChilAm korChilAm kartAslAm

root: kar (to do)

Page 12: Language Change as a Constrained Multi-Objective Optimization Monojit Choudhury Microsoft Research Lab, India monojitc@microsoft.com A tale of the lazy

Atomic Phonological Operators

kariteChila

kariChila

kairChila korChila

karitChila

korChilo

Del(e/t_Ch)

Del(t/_Ch)Met(ri/_Ch)

Asm(ao/_i)Mut(a o/_$)

Deletion, MetathesisAssimilation, Mutation

Page 13: Language Change as a Constrained Multi-Objective Optimization Monojit Choudhury Microsoft Research Lab, India monojitc@microsoft.com A tale of the lazy

Hypothesis

A sequence of Atomic Phonological Operators, is preferred if the verb forms obtained by application of this sequence on the classical forms have some functional benefit over the classical forms.

Thus, all the modern dialects of Bangla have some functional advantage over the classical dialect.

Page 14: Language Change as a Constrained Multi-Objective Optimization Monojit Choudhury Microsoft Research Lab, India monojitc@microsoft.com A tale of the lazy

A Formal Model of Functional Explanation

f1: Effort of articulation

f2: [Acoustic distinctiveness]-1

Unstable languages

Impossible languages

Metastable languages

Page 15: Language Change as a Constrained Multi-Objective Optimization Monojit Choudhury Microsoft Research Lab, India monojitc@microsoft.com A tale of the lazy

Genetic Algorithm

Gene (A string of symbols) How the solution actually looks like

GA: search for good solutions mimicking nature [recombination and mutation of genes]

Page 16: Language Change as a Constrained Multi-Objective Optimization Monojit Choudhury Microsoft Research Lab, India monojitc@microsoft.com A tale of the lazy

Phenotype

kori

korChi

:

korte

kori

kartAsi

:

kartA

Lexicon consisting of 28 forms for the verb kar

Page 17: Language Change as a Constrained Multi-Objective Optimization Monojit Choudhury Microsoft Research Lab, India monojitc@microsoft.com A tale of the lazy

Genotype

A sequence of atomic phonological operators

Del t Met ri NOP Del e Asm a Del i NOP

Dsm e NOP NOP Met ri Asm a Del e NOP

Page 18: Language Change as a Constrained Multi-Objective Optimization Monojit Choudhury Microsoft Research Lab, India monojitc@microsoft.com A tale of the lazy

Genotype Phenotype

karikariteChi

karite

Del t Met ri NOP Del e Asm a Del i NOP

karikarieChi

karie

kairkaireChi

kaire

korkorCh

kor

Page 19: Language Change as a Constrained Multi-Objective Optimization Monojit Choudhury Microsoft Research Lab, India monojitc@microsoft.com A tale of the lazy

Crossover

Page 20: Language Change as a Constrained Multi-Objective Optimization Monojit Choudhury Microsoft Research Lab, India monojitc@microsoft.com A tale of the lazy

Mutation

Page 21: Language Change as a Constrained Multi-Objective Optimization Monojit Choudhury Microsoft Research Lab, India monojitc@microsoft.com A tale of the lazy

Multi-Objective GA

Page 22: Language Change as a Constrained Multi-Objective Optimization Monojit Choudhury Microsoft Research Lab, India monojitc@microsoft.com A tale of the lazy

Multi-Objective GA: Apply constraints

Page 23: Language Change as a Constrained Multi-Objective Optimization Monojit Choudhury Microsoft Research Lab, India monojitc@microsoft.com A tale of the lazy

Multi-Objective GA: Apply constraints

Page 24: Language Change as a Constrained Multi-Objective Optimization Monojit Choudhury Microsoft Research Lab, India monojitc@microsoft.com A tale of the lazy

Multi-Objective GA: Finding out good solutions

Page 25: Language Change as a Constrained Multi-Objective Optimization Monojit Choudhury Microsoft Research Lab, India monojitc@microsoft.com A tale of the lazy

Multi-Objective GA: But also keep some not-so-good solutions

Page 26: Language Change as a Constrained Multi-Objective Optimization Monojit Choudhury Microsoft Research Lab, India monojitc@microsoft.com A tale of the lazy

Multi-Objective GA: But also keep some not-so-good solutions

Page 27: Language Change as a Constrained Multi-Objective Optimization Monojit Choudhury Microsoft Research Lab, India monojitc@microsoft.com A tale of the lazy

Multi-Objective GA: After several iterations

Page 28: Language Change as a Constrained Multi-Objective Optimization Monojit Choudhury Microsoft Research Lab, India monojitc@microsoft.com A tale of the lazy

Objective functions

• Articulatory effort– fe(Λ): weighted sum of number of syllables,

letters and vowel height differences averaged over all words in the lexicon

• Acoustic Distinctiveness– fd(Λ): Inverse of mean edit distance between

words

• Learnability– fr(Λ): correlation between feature match and

edit distance

Page 29: Language Change as a Constrained Multi-Objective Optimization Monojit Choudhury Microsoft Research Lab, India monojitc@microsoft.com A tale of the lazy

Experiments

• NSGA – II : a package for fast MOGA• Gene length: 15 APOs• A repertoire of 128 APOs• Population: 1000, Generation: 500• 6 Models with different combinations of

constraints and objectives

Page 30: Language Change as a Constrained Multi-Objective Optimization Monojit Choudhury Microsoft Research Lab, India monojitc@microsoft.com A tale of the lazy

Pareto-optimal front

CB

SylhettiACB

SCB

Page 31: Language Change as a Constrained Multi-Objective Optimization Monojit Choudhury Microsoft Research Lab, India monojitc@microsoft.com A tale of the lazy

Observations

• vertical and horizontal limb• real dialects on the horizontal limb• Sound changes push the dialects from right

to left (reduce effort)• but never up the limb• why?

Page 32: Language Change as a Constrained Multi-Objective Optimization Monojit Choudhury Microsoft Research Lab, India monojitc@microsoft.com A tale of the lazy

Role of Constraints

Page 33: Language Change as a Constrained Multi-Objective Optimization Monojit Choudhury Microsoft Research Lab, India monojitc@microsoft.com A tale of the lazy

For more information

Choudhury et al., Evolution optimization and language change: the case of Bengali verb inflections, in Proceedings of ACL SIGMORPHON9, Association for Computational Linguistics, 2007

http://research.microsoft.com/people/monojitc/

MOGA and NSGA IIKanpur Genetic Algorithms Laboratory

http://www.iitk.ac.in/kangal/index.shtml

Page 34: Language Change as a Constrained Multi-Objective Optimization Monojit Choudhury Microsoft Research Lab, India monojitc@microsoft.com A tale of the lazy

Food for Thought

• Evaluation:– Myriads of possible dialects, but only a few

observed in nature

• Fixed set of pre-defined APOs – how to generalize for any change?

• MOGA is an optimization tool, which in no way simulates language change– How do languages optimize themselves?

Page 35: Language Change as a Constrained Multi-Objective Optimization Monojit Choudhury Microsoft Research Lab, India monojitc@microsoft.com A tale of the lazy

Outline of the Talk

• Morpho-phonological change of Bangla Verb systems and emergence of dialect diversity– Approach: Multi-Objective Constrained Optimization– Technique: Multi-Objective Genetic Algorithm (MOGA)

• Understanding Computer Mediated Communication– Normalization of Texting language – Romanization of Indian Language text

Page 36: Language Change as a Constrained Multi-Objective Optimization Monojit Choudhury Microsoft Research Lab, India monojitc@microsoft.com A tale of the lazy

Computer Mediated Communication

Form

Page 37: Language Change as a Constrained Multi-Objective Optimization Monojit Choudhury Microsoft Research Lab, India monojitc@microsoft.com A tale of the lazy

Texting Language

• A new genre of English & also other languages used in chats, sms, emails, blogs, tweets, FB posts, comments etc.

dis is n eg 4 txtin lang

This is an example for Texting language

Page 38: Language Change as a Constrained Multi-Objective Optimization Monojit Choudhury Microsoft Research Lab, India monojitc@microsoft.com A tale of the lazy

Texting Language

• A new genre of English & also other languages used in chats, sms, emails, blogs, etc.

• Ungrammatical, unconventional spellings

dis is n eg 4 txtin lang

This is an example for Texting language

24 39

The shorter the fasterConstraint: understandability

Page 39: Language Change as a Constrained Multi-Objective Optimization Monojit Choudhury Microsoft Research Lab, India monojitc@microsoft.com A tale of the lazy

Analysis of Social Media

• A hot topic in NLP– Normalization– Language identification– Sentiment/Polarity detection– Summarization/trend prediction

Choudhury et al. (2007) Investigation and Modeling of the Structure of Texting Language. In IJCAI Workshop on Analytics of Noisy Data 2007

Page 40: Language Change as a Constrained Multi-Objective Optimization Monojit Choudhury Microsoft Research Lab, India monojitc@microsoft.com A tale of the lazy

Tomorrow never dies!!!

• 2moro (9)• tomoz (25) • tomoro (12) • tomrw (5)• tom (2)• tomra (2)• tomorrow (24)• tomora (4)

• tomm (1)• tomo (3)• tomorow (3)• 2mro (2)• morrow (1)• tomor (2)• tmorro (1)• moro (1)

Page 41: Language Change as a Constrained Multi-Objective Optimization Monojit Choudhury Microsoft Research Lab, India monojitc@microsoft.com A tale of the lazy

Patterns or Compression Operators

• Phonetic substitution (phoneme)– psycho syco, then den

• Phonetic substitution (syllable)– today 2day , see c

• Deletion of vowels– message mssg, about abt

• Deletion of repeated characters– tomorrow tomorow

Page 42: Language Change as a Constrained Multi-Objective Optimization Monojit Choudhury Microsoft Research Lab, India monojitc@microsoft.com A tale of the lazy

Patterns or Compression Operators

• Truncation (deletion of tails)– introduction intro, evaluation eval

• Common Abbreviations– Bangalore blr, text back tb

• Informal pronunciation– going to gonna, better betta

Page 43: Language Change as a Constrained Multi-Objective Optimization Monojit Choudhury Microsoft Research Lab, India monojitc@microsoft.com A tale of the lazy

HMMs for SMS Normalization

G1

‘T’

S6

G2

‘O’G3

‘D’G4

‘A’G5

‘Y’

S0

P2

/AH/P4

/AY/

S1

“2”

ε T @ ε O @ ε D @ ε A @ ε Y @

Page 44: Language Change as a Constrained Multi-Objective Optimization Monojit Choudhury Microsoft Research Lab, India monojitc@microsoft.com A tale of the lazy

Bigram Examples

• TL: would b gd 2 c u some time soon• Op: would be good to see you some time soon

• TL: just wanted 2 say a big thanx 4 my bday card• Op: just wanted to say a big thanks for my today

card

• TL: me wel i fink bein at home makes me feel a lot more stressed den bein away from it

• Op: me well i think being at home makes me feel a lot more stressed deny being away from it

Page 45: Language Change as a Constrained Multi-Objective Optimization Monojit Choudhury Microsoft Research Lab, India monojitc@microsoft.com A tale of the lazy

Code mixing

Transliteration

Spelling Change

Indian English

Use of Indian Languages on Online Social Media

Page 46: Language Change as a Constrained Multi-Objective Optimization Monojit Choudhury Microsoft Research Lab, India monojitc@microsoft.com A tale of the lazy

Concluding Remarks

• Languages are perpetually evolving and optimizing systems– Computational modeling of language change is

still in its infancy– Lots of scope for research

Page 47: Language Change as a Constrained Multi-Objective Optimization Monojit Choudhury Microsoft Research Lab, India monojitc@microsoft.com A tale of the lazy

Thank [email protected]

Questions??

Page 48: Language Change as a Constrained Multi-Objective Optimization Monojit Choudhury Microsoft Research Lab, India monojitc@microsoft.com A tale of the lazy

Why Computational Models?

FOR AGAINST

Formalization

Virtual experimentation

Exploration

Intractable

Simplified assumptions

Toy languages

Can we modelreal world language change?

Page 49: Language Change as a Constrained Multi-Objective Optimization Monojit Choudhury Microsoft Research Lab, India monojitc@microsoft.com A tale of the lazy

Objectives and Constraints - 1

• Articulatory effort

fe(w) = α1 fe1(w) + α2 fe2(w) + α3 fe3(w)

fe1(w) = |w|

fe2(w) = hr(σi)

fe3(w) = |ht(Vi) - ht(Vi+1)|

Page 50: Language Change as a Constrained Multi-Objective Optimization Monojit Choudhury Microsoft Research Lab, India monojitc@microsoft.com A tale of the lazy

Objectives and Constraints - 2

• Acoustic distinctiveness

fd(Λ) = (1/N) ed(wi,wj)-1

Cd(Λ) = -1 if ed(wi,wj) = 0 for > 2 pairs

• Phonotactic constraints

Cp(Λ) = -1 if any of the words violate the phonotactic constraints of the language

Page 51: Language Change as a Constrained Multi-Objective Optimization Monojit Choudhury Microsoft Research Lab, India monojitc@microsoft.com A tale of the lazy

Objectives and Constraints - 3

• Learnability as Regularity– fr: The correlation coefficient between the edit

distance and number of matching morphological attributes for every word pair

– Cr = -1 if fr > 0.8

Page 52: Language Change as a Constrained Multi-Objective Optimization Monojit Choudhury Microsoft Research Lab, India monojitc@microsoft.com A tale of the lazy

Emergent dialects

Classical D1 D2 D3

kariteChilAm kartA karChi(korChi)

karteChi(kartAsi)

kariteChila kartAa karCha(korCha)

karteCha(kartAsa)

kariteChilen kartAen karChen(korChen)

karteChen(kartAsen)