Testing Functional Explanations of Word Order Universals
Michael Hahn Richard FutrellStanford UC Irvine
(Greenberg 1963)
U3: ‘Languages with dominant VSO order are alwaysprepositional.’
U3: ‘Languages with dominant VSO order are alwaysprepositional.’
U4: ‘With overwhelmingly greater than chancefrequency, languages with normal SOV order arepostpositional.’
U3: ‘Languages with dominant VSO order are alwaysprepositional.’
U4: ‘With overwhelmingly greater than chancefrequency, languages with normal SOV order arepostpositional.’
`Relative position of adposition & noun ~relative position ofverb & object’
OV languages with postpositions
VO languages with prepositions
Why do these universals hold?
Innate constraints on language, ‘Universal Grammar’? (Chomsky 1981)
Facilitation of human communication? (Dryer 1992, Hawkins 1994)
Make languages learnable? (Culbertson 2017)
Why do these universals hold?
Innate constraints on language, ‘Universal Grammar’? (Chomsky 1981)
Facilitation of human communication? (Dryer 1992, Hawkins 1994)
Approach: Test functional explanations by implementing efficiency measures, optimizing grammars, and checking whether universals hold in optimized grammars.
Make languages learnable? (Culbertson 2017)
Three Efficiency Measures
Dependency Length Minimization (Rijkhoff, 1986; Hawkins, 1994, 2003)
Surprisal (Gildea and Jaeger, 2015; Ferrer-i Cancho, 2017)
Parsability (Hawkins, 1994, 2003)
Three Efficiency Measures
Dependency Length Minimization (Rijkhoff, 1986; Hawkins, 1994, 2003)
Three Efficiency Measures
Dependency Length Minimization (Rijkhoff, 1986; Hawkins, 1994, 2003)
21 1
Three Efficiency Measures
Dependency Length Minimization (Rijkhoff, 1986; Hawkins, 1994, 2003)
21 1+ + = 4
Three Efficiency MeasuresSurprisal
Surprisal(w1...wi-1) = -Σi log P(wi|w1...wi-1)
Three Efficiency MeasuresSurprisal
Surprisal(w1...wi-1) = -Σi log P(wi|w1...wi-1)
Estimated using recurrent neural networks, the strongest existing methods for estimating surprisal and predicting reading times.
Three Efficiency MeasuresParsability
Mary has two green books.
Three Efficiency MeasuresParsability
Mary has two green books.
Parsability(utterance) := log P(tree | utterance)
Three Efficiency MeasuresParsability
Mary has two green books.
Parsability(utterance) := log P(tree | utterance)
Estimated using a neural network model (Dozat and Manning 2017)
with extremely generic architecture.
Utility Informativity Cost-=
Amount of Meaning that can be extracted from utterance
Cost of processing utterance
λ
Combining Parsability + Surprisal
Utility Informativity Cost-=
Amount of Meaning that can be extracted from utterance
Cost of processing utterance
Long tradition as an explanation of language (Gabelentz 1903, Zipf 1949, Horn 1984, …)
λ
Combining Parsability + Surprisal
Utility Informativity Cost-=
Amount of Meaning that can be extracted from utterance ~ Parsability
Cost of processing utterance
~ Surprisal
λ
Combining Parsability + Surprisal
Long tradition as an explanation of language (Gabelentz 1903, Zipf 1949, Horn 1984, …)
Utility Informativity Cost-=
Amount of Meaning that can be extracted from utterance ~ Parsability
Cost of processing utterance
~ SurprisalLong tradition as an explanation of language (Gabelentz 1903, Zipf 1949, Horn 1984, …)
Formalized in Rational-Speech Acts models (Frank and Goodman 2012)
λ
Combining Parsability + Surprisal
Utility Informativity Cost-=
Long tradition as an explanation of language (Gabelentz 1903, Zipf 1949, Horn 1984, …)
Formalized in Rational-Speech Acts models (Frank and Goodman 2012)
Related to Signal Processing (Rate-Distortion Theory, Information Bottleneck)
λ
Combining Parsability + Surprisal
Amount of Meaning that can be extracted from utterance ~ Parsability
Cost of processing utterance
~ Surprisal
Why do the universals hold?
Innate constraints on language, ‘Universal Grammar’? (Chomsky 1981)
Facilitation of human communication? (Dryer 1992, Hawkins 1994)
Approach: Test processing explanations by implementing efficiency measures, optimizing grammars, and checking whether universals hold in optimized grammars.
Make languages learnable? (Culbertson 2017)
Testing Functional Explanations
Approach: Optimize the word orders of languages for the three objectives, keeping syntactic structures unchanged
Testing Functional Explanations
Approach: Optimize the word orders of languages for the three objectives, keeping syntactic structures unchanged
Languages have word order regularities ⇒ Not sufficient to optimize the word orders of individual sentences
Testing Functional Explanations
Approach: Optimize the word orders of languages for the three objectives, keeping syntactic structures unchanged
Languages have word order regularities ⇒ Not sufficient to optimize the word orders of individual sentences
Instead: optimize word order rules of entire languages
Testing Functional Explanations
Approach: Optimize the word orders of languages for the three objectives, keeping syntactic structures unchanged
Languages have word order regularities ⇒ Not sufficient to optimize the word orders of individual sentences
Instead: optimize word order rules of entire languages
That is: optimized languages have optimized but internally consistent grammatical regularities in word order, and agree with an actual natural language in all other respects.
Mary has two green books
nsubj
dobj
nummod
amod
Dependency Corpus
Mary has two green books
nsubj
dobj
nummod
amod
Mary
hastwo
greenbooks
Tree Topologies
Dependency Corpus
Mary has two green books
nsubj
dobj
nummod
amod
Mary
hastwo
greenbooks
Tree Topologies
Dependency Corpus Ordering GrammarNOUN ADJamod
0.3
NOUN NUMnummod
VERB NOUNnsubj
VERB NOUNdobj
...
0.7
-0.2
0.8
Mary has two green books
nsubj
dobj
nummod
amod
Mary
hastwo
greenbooks
Tree Topologies
Dependency Corpus Ordering GrammarNOUN ADJamod
0.3
NOUN NUMnummod
VERB NOUNnsubj
VERB NOUNdobj
...
0.7
-0.2
0.8
“Object follows verb”
Mary has two green books
nsubj
dobj
nummod
amod
Mary
hastwo
greenbooks
Tree Topologies
Dependency Corpus Ordering GrammarNOUN ADJamod
0.3
NOUN NUMnummod
VERB NOUNnsubj
VERB NOUNdobj
...
0.7
-0.2
0.8
“Adjective precedes noun”
“Object follows verb”
Mary has two green books
nsubj
dobj
nummod
amod
Mary
hastwo
greenbooks
Tree Topologies
Dependency Corpus Ordering GrammarNOUN ADJamod
0.3
NOUN NUMnummod
VERB NOUNnsubj
VERB NOUNdobj
...
0.7
-0.2
0.8
“Adjective precedes noun”
“Object follows verb”
“Numerals follow adjectives & precede nouns”
Mary has two green books
nsubj
dobj
nummod
amod
Mary
hastwo
greenbooks
Tree Topologies
Maryhastwogreenbooks
Counterfactual Corpus
Dependency Corpus Ordering GrammarNOUN ADJamod
0.3
NOUN NUMnummod
VERB NOUNnsubj
VERB NOUNdobj
...
0.7
-0.2
0.8
Mary has two green books
nsubj
dobj
nummod
amod
Mary
hastwo
greenbooks
Tree Topologies
Maryhastwogreenbooks
Counterfactual Corpus
Dependency Corpus Ordering GrammarNOUN ADJamod
0.3
NOUN NUMnummod
VERB NOUNnsubj
VERB NOUNdobj
...
0.7
-0.2
0.8
Each parameter setting generates a different counterfactual corpus.
Mary has two green books
nsubj
dobj
nummod
amod
Mary
hastwo
greenbooks
Tree Topologies
Maryhastwogreen books
Counterfactual Corpus
Dependency Corpus Ordering GrammarNOUN ADJamod
0.9
NOUN NUMnummod
VERB NOUNnsubj
VERB NOUNdobj
...
0.1
0.5
0.2
Each parameter setting generates a different counterfactual corpus.
Mary has two green books
nsubj
dobj
nummod
amod
Mary
hastwo
greenbooks
Tree Topologies
Maryhas twogreenbooks
Counterfactual Corpus
Dependency Corpus Ordering GrammarNOUN ADJamod
0.1
NOUN NUMnummod
VERB NOUNnsubj
VERB NOUNdobj
...
0.95
04.2
0.82
Each parameter setting generates a different counterfactual corpus.
Dependency Length Surprisal
Parsability
2.35.81.8
We compute processing measures on counterfactual corpora.
Dependency Length Surprisal
Parsability
2.35.81.8
Each parameter setting results in different values for the processing measures.
Dependency Length Surprisal
Parsability
2.94.52.9
Each parameter setting results in different values for the processing measures.
Dependency Length Surprisal
Parsability
3.47.81.2
Each parameter setting results in different values for the processing measures.
Dependency Length Surprisal
Parsability
3.47.81.2
Each parameter setting results in different values for the processing measures.
Which settings optimise the measures?
Dependency Length Surprisal
Parsability
3.47.81.2
Each parameter setting results in different values for the processing measures.
Which settings optimise the measures?
Do the optimised settings replicate the Greenberg correlations?
For each objective, find parameters that optimise it.
NOUN ADJamod0.1
NOUN NUMnummod
VERB NOUNnsubj
VERB NOUNdobj
...
0.95
04.2
0.82
NOUN ADJamod0.1
NOUN NUMnummod
VERB NOUNnsubj
VERB NOUNdobj
...
0.85
0.1
0.22
Minimize Dep. Length Minimize Surprisal
NOUN ADJamod0.1
NOUN NUMnummod
VERB NOUNnsubj
VERB NOUNdobj
...
0.7
0.5
0.8
NOUN ADJamod0.21
NOUN NUMnummod
VERB NOUNnsubj
VERB NOUNdobj
...
0.45
0.4
0.32
Maximize Parsability Optimize Pars.+Surp.
For each objective, find parameters that optimise it.
Repeat this for corpora from 51 real languages from Universal Dependencies Project.
NOUN ADJamod0.1
NOUN NUMnummod
VERB NOUNnsubj
VERB NOUNdobj
...
0.95
04.2
0.82
NOUN ADJamod0.1
NOUN NUMnummod
VERB NOUNnsubj
VERB NOUNdobj
...
0.85
0.1
0.22
Minimize Dep. Length Minimize Surprisal
NOUN ADJamod0.1
NOUN NUMnummod
VERB NOUNnsubj
VERB NOUNdobj
...
0.7
0.5
0.8
NOUN ADJamod0.21
NOUN NUMnummod
VERB NOUNnsubj
VERB NOUNdobj
...
0.45
0.4
0.32
Maximize Parsability Optimize Pars.+Surp.
For each objective, find parameters that optimise it.
Repeat this for corpora from 51 real languages from Universal Dependencies Project.
0.1
0.95
04.2
0.82
0.1
0.85
0.1
0.22
Minimize Dep. Length Minimize Surprisal
NOUN ADJamod0.1
NOUN NUMnummod
VERB NOUNnsubj
VERB NOUNdobj
...
NOUN ADJ 0.1
NOUN NUMnummod
VERB NOUNnsubj
VERB NOUNdobj
...
0.7
0.5
0.8
0.7
0.5
0.8
0.21
0.45
NOUN ADJ 0.1
NOUN
NOUN ADJ 0.1
NOUN NUMnummod
VERB NOUNnsubj
VERB NOUNdobj
...
0.7
0.5
0.8
NUMnummod
VERB NOUNnsubj
VERB NOUNdobj
...
0.7
0.5
0.8
0.4
0.32
Maximize Parsability Optimize Pars.+Surp.
1. How do the objectives compare?2. Which universals are predicted?
Minimize Dep. Length Minimize Surprisal
Surprisal and Parsability minimize Dependency Length
Surprisal and Parsability minimize Dependency Length
Surprisal and Parsability minimize Dependency Length
Communicative Utility predicts Dependency Length Minimization.
Better Parsability
Lower Surprisal
z-transformed on the level of languages
Language optimizes Surprisal and Parsability
Better Parsability
Lower Surprisal
Random Grammars
Language optimizes Surprisal and Parsability
Better Parsability
Lower Surprisal
Random Grammars
Grammars fit to Real Orderings
Language optimizes Surprisal and Parsability
Better Parsability
Better Parsability
Lower Surprisal
Random Grammars
Optimized for Surprisal
Optimized for Parsability
Optimized for Parsability+Surprisal
Grammars fit to Real Orderings
Language optimizes Surprisal and Parsability
(Dryer 1992 in Language)
(Dryer 1992 in Language)
`Relative position of adposition & noun ~relative position ofverb & object’
We formalize the correlations in the Universal Dependencies format.
(Dryer 1992 in Language)
X
XX
We formalize the correlations in the Universal Dependencies format.
For any word order grammar, we can then check which correlations it satisfies.
Are the universals satisfied by models fit to the actual orderings for our 50 languages?
%
Are the universals satisfied by models fit to the actual orderings for our 50 languages?
%
Are the universals satisfied by models fit to the actual orderings for our 50 languages?
Prevalence of SVO (Dryer 1992)
Limitation of formalisation
%
Percentage of grammars optimized for each objective satisfying the universal
Percentage of grammars optimized for each objective satisfying the universal
Assessing Significance:X = “Object precedes verb”Y = “Object-patterner precedes verb-patterner”
Logistic model:Y ~ X + (1+X|family) + (1+X|language)
Predictions largely complementary
Predictions mostly agree
Predictions mostly agree
Communicative Utility replicates predictions of Dependency Length Minimization.
Predictions mostly agree
Communicative Utility replicates predictions of Dependency Length Minimization.Both measures predict most of the correlation universals.
Conclusion
● Tested explanations of Greenberg correlation universals in terms of efficiency of human processing and communication
● Using corpora from 50 languages, constructed counterfactual optimized languages
● Most of the correlations can be derived from pressure to shorten dependencies, decrease surprisal, or increase parsability
● Clear evidence for functional explanations of word order universals
Optimized grammars are easier to parse even when sentences are presented in orders very different from natural language
ACEBD ADBEC ACEDBABCDE
Random grammarOptimized grammar
Random grammars remain hard to parse even as training data increases.
Formalizing ParsabilityNeural parser (Dozat and Manning 2017):
Mary met John
R
Mar
y
met
Jo
hn 1. BiLSTM reads the sentence2. Identify heads by
computing score for each pair of words
Generic architecture, no assumptions beyond sequential nature of input.
Formalizing ParsabilityInformation about syntactic tree that can be extracted from sentence:
Mary met John
R
Mar
y
met
Jo
hn
Formalizing Dependency Length
Distance between word and its syntactic head
summing over all words in sentence
sentence w = w1...wn
Formalizing Surprisal
Formalizing Surprisal
summing over all words in sentence
per-word surprisal
Formalizing Surprisal
Surprisal depends on the probability model P.
Formalizing Surprisal
Surprisal depends on the probability model P.
Right choice of P depends on the entire language!
Formalizing Surprisal
Given a word order grammar θd choose the model that minimizes surprisal on the resulting sentences.
Formalizing Surprisal
Use LSTM recurrent neural networks, the SOTA in probabilistic modelling of natural language and predicting reading times.Very general sequence models, arguably minimizing architectural biases.
Formalizing Informativity
Information about the syntactic tree that can be extracted from the sentence:
Formalizing Informativity
Information about the syntactic tree that can be extracted from the sentence:
Use a recent neural model (Dozat and Manning 2017) with generic architecture and SOTA performance on many languages.
Word Order Grammars
For each dependency type, there are two parameters:a. α: probability that whether dependent precede headb. β: determines distance
Mary
has
two green
booksαverb-object = 0.1
αverb-subject = 0.95
αnoun-numeral = 0.99 αnoun-adjective = 0.8
Mary
has
two green
booksαverb-object = 0.1
αverb-subject = 0.95
αnoun-numeral = 0.99 αnoun-adjective = 0.8
Maryhas
two
green
books
Word Order Grammars
For each dependency type, there are two parameters:a. α: probability that dependent precede headb. β: determines distance
Mary
has
two
green
books
βNoun-Adjective = -0.3
βNoun-Numeral = 0.8
Mary
has
two
green
books
βNoun-Adjective = -0.3
βNoun-Numeral = 0.8
softmax(βNoun-Adjective , βNoun-Numeral ) ~ (0.1, 0.9)
adjective first
numeral first
Mary
has
two
green
books
βNoun-Adjective = -0.3
βNoun-Numeral = 0.8
softmax(βNoun-Adjective , βNoun-Numeral ) ~ (0.1, 0.9)
adjective first
numeral first
Maryhas
twogreen
books
Maryhastwogreenbooks
Word Order Grammars
For each dependency type, there are two parameters:a. α: probability that dependent precede headb. β: determines distance
This specifies the space of possible grammars, within which we optimize.
Mary has two green books
nsubj
dobj
nummod
amod
Mary
hastwo
greenbooks
Tree Topologies
Maryhastwogreenbooks
Counterfactual Corpus
Dependency Corpus Ordering GrammarNOUN ADJamod
0.3
NOUN NUMnummod
VERB NOUNnsubj
VERB NOUNdobj
...
0.7
-0.2
0.8
Mary has two green books
nsubj
obj
nummod
amod
Will be working with trees in the Universal Dependencies format:
Mary has two green books
nsubj
obj
nummod
amod
Will be working with trees in the Universal Dependencies format:
To optimize grammars, we need a space of possible grammars.
SOV
SVO
SOV and VSO support correlationSVO does not
VSO
Support SVO(Gibson et al 2013)
Dependency Length MinimizationShort syntactic dependencies ease processing (Gibson, 1998; Grodner and Gibson, 2005; Demberg and Keller, 2008; Bartek et al., 2011)
Dependency Length MinimizationShort syntactic dependencies ease processing (Gibson, 1998; Grodner and Gibson, 2005; Demberg and Keller, 2008; Bartek et al., 2011)
Quantitative corpus evidence from many languages confirms that languages have shorter dependencies than would be expected at random (Futrell et al., 2015).
Dependency Length MinimizationShort syntactic dependencies ease processing (Gibson, 1998; Grodner and Gibson, 2005; Demberg and Keller, 2008; Bartek et al., 2011)
Quantitative corpus evidence from many languages confirms that languages have shorter dependencies than would be expected at random (Futrell et al., 2015).
Argued to explain several of the Greenberg correlations (Rijkhoff, 1986; Hawkins, 1994, 2003)
21 1
Two Objectives for Optimization
Dependency Length Minimization
Communicative Utility
Two Objectives for Optimization
Communicative Utility
Utility Informativity Cost-=
Amount of Meaning that can be extracted from utterance
Cost of processing utterance
λ
Long tradition as an explanation of language (Gabelentz 1903, Zipf 1949, Horn 1984, …)
Two Objectives for Optimization
Communicative Utility
Utility Informativity Cost-= λ
Two Objectives for OptimizationCommunicative Utility
Utility Informativity Cost-= λ
Mary has two green books.
Two Objectives for OptimizationCommunicative Utility
Utility Informativity Cost-= λ
Mary has two green books.
Informativity(utterance) := log P(tree | utterance) - log P(tree)
Two Objectives for OptimizationCommunicative Utility
Utility Informativity Cost-= λ
Mary has two green books.
Informativity(utterance) := log P(tree | utterance) - log P(tree)We use a neural network model (Dozat and Manning 2017) with extremely generic architecture.
Two Objectives for OptimizationCommunicative Utility
Utility Informativity Cost-= λ
Two Objectives for OptimizationCommunicative Utility
Utility Informativity Cost-= λ
Surprisal(wi|w1...wi-1) = -log P(wi|w1...wi-1)
Two Objectives for OptimizationCommunicative Utility
Utility Informativity Cost-= λ
Surprisal(wi|w1...wi-1) = -log P(wi|w1...wi-1)
We use recurrent neural networks, the SOTA in probabilistic modelling of natural language and predicting reading times.
Dependency Length Surprisal
Parsability
2.35.81.8
(1) For each objective, find parameters that optimise it.
(2) Which universals do the resulting counterfactual languages satisfy?
Dependency Length Surprisal
Parsability
2.35.81.8
(1) For each objective, find parameters that optimise it.
(2) Which universals do the resulting counterfactual languages satisfy?