Upload
vanmien
View
223
Download
0
Embed Size (px)
Citation preview
dept . o f theoret i ca l and app l i ed l i ngu i s t i c s
Towards a Computational Model
of Grammaticalization and
Lexical Diversity
Christian Bentz & Paula [email protected]
dept . o f theoret i ca l and app l i ed l i ngu i s t i c s
Outline
Background
• Lexical diversity
• Grammaticalization
Computational Model
• Architecture
• Outcome
Future Directions• Model improvement
Christian Bentz & Paula Buttery [email protected] — Grammaticalization
2/18
dept . o f theoret i ca l and app l i ed l i ngu i s t i c s
Lexical diversity
Definition
Definition: The distribution of word forms used to encode aconstant information content
Parallel texts:- Universal Decalaration of Human Rights (∼ 400 languages)- Parallel Bible Corpus (∼ 1000 languages)- Europarl (21 languages)
Christian Bentz & Paula Buttery [email protected] — Grammaticalization
3/18
dept . o f theoret i ca l and app l i ed l i ngu i s t i c s
Lexical diversity
Definition
Definition: The distribution of word forms used to encode aconstant information content
Parallel texts:- Universal Decalaration of Human Rights (∼ 400 languages)- Parallel Bible Corpus (∼ 1000 languages)- Europarl (21 languages)
Christian Bentz & Paula Buttery [email protected] — Grammaticalization
3/18
dept . o f theoret i ca l and app l i ed l i ngu i s t i c s
Lexical diversity
Driving factors
• Morphological markingEnglish: the shipGerman: das Schiff, dem Schiff(e), des Schiffes
• CompoundingEnglish: key to the cabin of the captain of the shipGerman: Schifffahrtskapitaenkabinenschluessel
• LexiconEnglish: closeGerman: zuschliessen, abschliessen
• Orthography
• etc.
Christian Bentz & Paula Buttery [email protected] — Grammaticalization
4/18
dept . o f theoret i ca l and app l i ed l i ngu i s t i c s
Lexical diversity
Driving factors
• Morphological markingEnglish: the shipGerman: das Schiff, dem Schiff(e), des Schiffes
• CompoundingEnglish: key to the cabin of the captain of the shipGerman: Schifffahrtskapitaenkabinenschluessel
• LexiconEnglish: closeGerman: zuschliessen, abschliessen
• Orthography
• etc.
Christian Bentz & Paula Buttery [email protected] — Grammaticalization
4/18
dept . o f theoret i ca l and app l i ed l i ngu i s t i c s
Lexical diversity
Driving factors
• Morphological markingEnglish: the shipGerman: das Schiff, dem Schiff(e), des Schiffes
• CompoundingEnglish: key to the cabin of the captain of the shipGerman: Schifffahrtskapitaenkabinenschluessel
• LexiconEnglish: closeGerman: zuschliessen, abschliessen
• Orthography
• etc.
Christian Bentz & Paula Buttery [email protected] — Grammaticalization
4/18
dept . o f theoret i ca l and app l i ed l i ngu i s t i c s
Lexical diversity
Driving factors
• Morphological markingEnglish: the shipGerman: das Schiff, dem Schiff(e), des Schiffes
• CompoundingEnglish: key to the cabin of the captain of the shipGerman: Schifffahrtskapitaenkabinenschluessel
• LexiconEnglish: closeGerman: zuschliessen, abschliessen
• Orthography
• etc.
Christian Bentz & Paula Buttery [email protected] — Grammaticalization
4/18
dept . o f theoret i ca l and app l i ed l i ngu i s t i c s
Lexical diversity
Quantitative measure
Zipf-Mandelbrots law: Order types (word forms delimited bywhite spaces) according to their token frequencies (Zipf, 1949;Mandelbrot, 1953)
Christian Bentz & Paula Buttery [email protected] — Grammaticalization
5/18
dept . o f theoret i ca l and app l i ed l i ngu i s t i c s
Lexical diversity
Quantitative measure
Zipf-Mandelbrots law: Order types (word forms delimited bywhite spaces) according to their token frequencies (Zipf, 1949;Mandelbrot, 1953)
Christian Bentz & Paula Buttery [email protected] — Grammaticalization
5/18
dept . o f theoret i ca l and app l i ed l i ngu i s t i c s
Lexical diversity
Zipf-Mandelbrot’slaw
f (ri ) =C
β + rαi,
C > 0,α > 0,β > −1,i = 1, 2, . . . , n
Christian Bentz & Paula Buttery [email protected] — Grammaticalization
6/18
dept . o f theoret i ca l and app l i ed l i ngu i s t i c s
Lexical diversity
Zipf-Mandelbrot’slaw
f (ri ) =C
β + rαi,
C > 0,α > 0,β > −1,i = 1, 2, . . . , n
Christian Bentz & Paula Buttery [email protected] — Grammaticalization
6/18
dept . o f theoret i ca l and app l i ed l i ngu i s t i c s
Diachrony
Bentz, Kiela, Hill & Buttery (2014) Zipf’s law and the grammar oflanguages. Corpus Linguistics and Linguistic Theory.
Christian Bentz & Paula Buttery [email protected] — Grammaticalization
7/18
dept . o f theoret i ca l and app l i ed l i ngu i s t i c s
Diachrony
Bentz, Kiela, Hill & Buttery (2014) Zipf’s law and the grammar oflanguages. Corpus Linguistics and Linguistic Theory.
Christian Bentz & Paula Buttery [email protected] — Grammaticalization
7/18
dept . o f theoret i ca l and app l i ed l i ngu i s t i c s
Synchrony
Bentz, Kiela, Hill & Buttery (in preparation) Adaptive languages:Modeling lexical diversity cross-linguistically.
Christian Bentz & Paula Buttery [email protected] — Grammaticalization
8/18
dept . o f theoret i ca l and app l i ed l i ngu i s t i c s
Synchrony
Bentz, Kiela, Hill & Buttery (in preparation) Adaptive languages:Modeling lexical diversity cross-linguistically.
Christian Bentz & Paula Buttery [email protected] — Grammaticalization
8/18
dept . o f theoret i ca l and app l i ed l i ngu i s t i c s
Lexical diversity
• Seems to be reduced in contact scenarios (non-nativelanguage learning, see Trudgill, 2011; Lupyan & Dale 2011;McWhorter, 2007)
Question
WHY DO LANGUAGES GET HIGH LEXICAL DIVERSITIES INTHE FIRST PLACE?
Christian Bentz & Paula Buttery [email protected] — Grammaticalization
9/18
dept . o f theoret i ca l and app l i ed l i ngu i s t i c s
Lexical diversity
• Seems to be reduced in contact scenarios (non-nativelanguage learning, see Trudgill, 2011; Lupyan & Dale 2011;McWhorter, 2007)
Question
WHY DO LANGUAGES GET HIGH LEXICAL DIVERSITIES INTHE FIRST PLACE?
Christian Bentz & Paula Buttery [email protected] — Grammaticalization
9/18
dept . o f theoret i ca l and app l i ed l i ngu i s t i c s
Grammaticalization
Definition
In the final stage of grammaticalization frequently co-occurringwords merge by means of phonological fusion (Bybee, 2003: 617)and hence ’morphologize’ to built inflections and derivations
Cline
content item >grammatical word >clitic >inflectional affix(Hopper and Traugott, 2003: 7)
Example
Old English lıc ’body’ → -lyLatin cantare habeo ’I have to sing’ → Italian cantero
Christian Bentz & Paula Buttery [email protected] — Grammaticalization
10/18
dept . o f theoret i ca l and app l i ed l i ngu i s t i c s
Grammaticalization
Definition
In the final stage of grammaticalization frequently co-occurringwords merge by means of phonological fusion (Bybee, 2003: 617)and hence ’morphologize’ to built inflections and derivations
Cline
content item >grammatical word >clitic >inflectional affix(Hopper and Traugott, 2003: 7)
Example
Old English lıc ’body’ → -lyLatin cantare habeo ’I have to sing’ → Italian cantero
Christian Bentz & Paula Buttery [email protected] — Grammaticalization
10/18
dept . o f theoret i ca l and app l i ed l i ngu i s t i c s
Grammaticalization
Definition
In the final stage of grammaticalization frequently co-occurringwords merge by means of phonological fusion (Bybee, 2003: 617)and hence ’morphologize’ to built inflections and derivations
Cline
content item >grammatical word >clitic >inflectional affix(Hopper and Traugott, 2003: 7)
Example
Old English lıc ’body’ → -lyLatin cantare habeo ’I have to sing’ → Italian cantero
Christian Bentz & Paula Buttery [email protected] — Grammaticalization
10/18
dept . o f theoret i ca l and app l i ed l i ngu i s t i c s
Grammaticalization
Hypothesis
Grammaticalization → increasing lexical diversityDeflexion → decreasing lexical diversity
Question
Can we computationally model the impact of grammaticalizationon lexical diversity?
Christian Bentz & Paula Buttery [email protected] — Grammaticalization
11/18
dept . o f theoret i ca l and app l i ed l i ngu i s t i c s
Grammaticalization
Hypothesis
Grammaticalization → increasing lexical diversityDeflexion → decreasing lexical diversity
Question
Can we computationally model the impact of grammaticalizationon lexical diversity?
Christian Bentz & Paula Buttery [email protected] — Grammaticalization
11/18
dept . o f theoret i ca l and app l i ed l i ngu i s t i c s
Computational Model
Starting point: Fijian UDHR
• parallel text, control for constant information content
• analytic language with low lexical diversity
Process
• merge a given percentage (pm) of frequently co-occurringwords over several generations (nG )
Endpoint
• Do we arrive at lexical diversities similar to the ones forGerman or Hungarian?
Christian Bentz & Paula Buttery [email protected] — Grammaticalization
12/18
dept . o f theoret i ca l and app l i ed l i ngu i s t i c s
Computational Model
Starting point: Fijian UDHR
• parallel text, control for constant information content
• analytic language with low lexical diversity
Process
• merge a given percentage (pm) of frequently co-occurringwords over several generations (nG )
Endpoint
• Do we arrive at lexical diversities similar to the ones forGerman or Hungarian?
Christian Bentz & Paula Buttery [email protected] — Grammaticalization
12/18
dept . o f theoret i ca l and app l i ed l i ngu i s t i c s
Computational Model
Starting point: Fijian UDHR
• parallel text, control for constant information content
• analytic language with low lexical diversity
Process
• merge a given percentage (pm) of frequently co-occurringwords over several generations (nG )
Endpoint
• Do we arrive at lexical diversities similar to the ones forGerman or Hungarian?
Christian Bentz & Paula Buttery [email protected] — Grammaticalization
12/18
dept . o f theoret i ca l and app l i ed l i ngu i s t i c s
Architecture
Input text
pv : % replace rR : range of rankspm: % merge
Output text
Christian Bentz & Paula Buttery [email protected] — Grammaticalization
13/18
dept . o f theoret i ca l and app l i ed l i ngu i s t i c s
Outputpm = 2.5, pv = 0; rR = 0; nG = 10
Christian Bentz & Paula Buttery [email protected] — Grammaticalization
14/18
dept . o f theoret i ca l and app l i ed l i ngu i s t i c s
Outputpm = 2.5, pv = 0; rR = 0; nG = 10
Christian Bentz & Paula Buttery [email protected] — Grammaticalization
14/18
dept . o f theoret i ca l and app l i ed l i ngu i s t i c s
Output
Words created in English
• ofthe → genitive marked article, German: des
• inthe → preposition merged with article, Italian: in + ilrendering nel
• topromote → preposition + verb, German: zusehen,zuschliessen
• ofsociety → preposition + noun (case prefix?)
• humanrights, humanbeing → compounding, German:Menschenrechte
• everyonehastherighttofreedomof, withoutanydiscrimination
Christian Bentz & Paula Buttery [email protected] — Grammaticalization
15/18
dept . o f theoret i ca l and app l i ed l i ngu i s t i c s
Output
Words created in English
• ofthe → genitive marked article, German: des
• inthe → preposition merged with article, Italian: in + ilrendering nel
• topromote → preposition + verb, German: zusehen,zuschliessen
• ofsociety → preposition + noun (case prefix?)
• humanrights, humanbeing → compounding, German:Menschenrechte
• everyonehastherighttofreedomof, withoutanydiscrimination
Christian Bentz & Paula Buttery [email protected] — Grammaticalization
15/18
dept . o f theoret i ca l and app l i ed l i ngu i s t i c s
Output
Words created in English
• ofthe → genitive marked article, German: des
• inthe → preposition merged with article, Italian: in + ilrendering nel
• topromote → preposition + verb, German: zusehen,zuschliessen
• ofsociety → preposition + noun (case prefix?)
• humanrights, humanbeing → compounding, German:Menschenrechte
• everyonehastherighttofreedomof, withoutanydiscrimination
Christian Bentz & Paula Buttery [email protected] — Grammaticalization
15/18
dept . o f theoret i ca l and app l i ed l i ngu i s t i c s
Output
Words created in English
• ofthe → genitive marked article, German: des
• inthe → preposition merged with article, Italian: in + ilrendering nel
• topromote → preposition + verb, German: zusehen,zuschliessen
• ofsociety → preposition + noun (case prefix?)
• humanrights, humanbeing → compounding, German:Menschenrechte
• everyonehastherighttofreedomof, withoutanydiscrimination
Christian Bentz & Paula Buttery [email protected] — Grammaticalization
15/18
dept . o f theoret i ca l and app l i ed l i ngu i s t i c s
Output
Words created in English
• ofthe → genitive marked article, German: des
• inthe → preposition merged with article, Italian: in + ilrendering nel
• topromote → preposition + verb, German: zusehen,zuschliessen
• ofsociety → preposition + noun (case prefix?)
• humanrights, humanbeing → compounding, German:Menschenrechte
• everyonehastherighttofreedomof, withoutanydiscrimination
Christian Bentz & Paula Buttery [email protected] — Grammaticalization
15/18
dept . o f theoret i ca l and app l i ed l i ngu i s t i c s
Output
Words created in English
• ofthe → genitive marked article, German: des
• inthe → preposition merged with article, Italian: in + ilrendering nel
• topromote → preposition + verb, German: zusehen,zuschliessen
• ofsociety → preposition + noun (case prefix?)
• humanrights, humanbeing → compounding, German:Menschenrechte
• everyonehastherighttofreedomof, withoutanydiscrimination
Christian Bentz & Paula Buttery [email protected] — Grammaticalization
15/18
dept . o f theoret i ca l and app l i ed l i ngu i s t i c s
Future Directions
Model improvement
• Exploring models with varying parameters for vocabularyreplacement and merging of bigrams (comparison to actuallanguage change data)
• More realistic model by parsing and POS tagging
• Considering frequency measures beyond bigram frequencies
Christian Bentz & Paula Buttery [email protected] — Grammaticalization
16/18
dept . o f theoret i ca l and app l i ed l i ngu i s t i c s
Future Directions
Model improvement
• Exploring models with varying parameters for vocabularyreplacement and merging of bigrams (comparison to actuallanguage change data)
• More realistic model by parsing and POS tagging
• Considering frequency measures beyond bigram frequencies
Christian Bentz & Paula Buttery [email protected] — Grammaticalization
16/18
dept . o f theoret i ca l and app l i ed l i ngu i s t i c s
Future Directions
Model improvement
• Exploring models with varying parameters for vocabularyreplacement and merging of bigrams (comparison to actuallanguage change data)
• More realistic model by parsing and POS tagging
• Considering frequency measures beyond bigram frequencies
Christian Bentz & Paula Buttery [email protected] — Grammaticalization
16/18
dept . o f theoret i ca l and app l i ed l i ngu i s t i c s
Collaborators
Douwe Kiela Felix Hill Andrew Caines
Christian Bentz & Paula Buttery [email protected] — Grammaticalization
17/18